Professional Documents
Culture Documents
Version 2.1
Jennifer Aspesi
Oliver Shorey
Copyright © 2010 - 2012 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is
subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO
REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable
software license.
For the most up-to-date regulatory document for your product line, go to the Technical Documentation and
Advisories section on EMC Powerlink.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Preface
Chapter 7 Conclusion
Conclusion........................................................................................ 120
Better protection from storage-related failures ....................121
Protection from a larger array of possible failures...............121
Greater overall resource utilization........................................122
Glossary
Title Page
1 Application and data mobility example ..................................................... 20
2 HA infrastructure example ........................................................................... 21
3 Distributed data collaboration example ..................................................... 22
4 VPLEX offerings ............................................................................................. 24
5 Architecture highlights.................................................................................. 26
6 VPLEX cluster example ................................................................................. 34
7 VPLEX Management Console ...................................................................... 42
8 Management Console welcome screen ....................................................... 43
9 VPLEX single engine configuration............................................................. 47
10 VPLEX dual engine configuration ............................................................... 48
11 VPLEX quad engine configuration .............................................................. 49
12 Port redundancy............................................................................................. 56
13 Director redundancy...................................................................................... 57
14 Engine redundancy ........................................................................................ 58
15 Site redundancy.............................................................................................. 59
16 High level functional sites in communication ........................................... 62
17 High level Site A failure ................................................................................ 63
18 High level Inter-site link failure ................................................................... 63
19 VPLEX active and functional between two sites ....................................... 64
20 VPLEX concept diagram with failure at Site A.......................................... 65
21 Correct resolution after volume failure at Site A....................................... 66
22 VPLEX active and functional between two sites ....................................... 67
23 Inter-site link failure and cluster partition ................................................. 68
24 Correct handling of cluster partition........................................................... 69
25 VPLEX static detach rule............................................................................... 71
26 Typical detach rule setup .............................................................................. 72
27 Non-preferred site failure ............................................................................. 73
28 Volume remains active at Cluster 1............................................................. 74
29 Typical detach rule setup before link failure ............................................. 75
30 Inter-site link failure and cluster partition ................................................. 76
Title Page
1 Overview of VPLEX features and benefits .................................................. 26
2 Configurations at a glance ............................................................................. 35
3 Management server user accounts ............................................................... 40
Audience This document is part of the EMC VPLEX family documentation set,
and is intended for use by storage and system administrators.
Readers of this document are expected to be familiar with the
following topics:
◆ Storage area networks
◆ Storage virtualization technologies
◆ EMC Symmetrix, VNX series, and CLARiiON products
Authors This TechBook was authored by the following individuals from the
Enterprise Storage Division, VPLEX Business Unit based at EMC
headquarters, Hopkinton, Massachusetts.
Jennifer Aspesi has over 10 years of work experience with EMC in
Storage Area Networks (SAN), Wide Area Networks (WAN), and
Network and Storage Security technologies. Jen currently manages
the Corporate Systems Engineer team for the VPLEX Business Unit.
She earned her M.S. in Marketing and Technological Innovation from
Worcester Polytech Institute, Massachusetts.
Oliver Shorey has over 11 years experience working within the
Business Continuity arena, seven of which have been with EMC
engineering, designing and documenting high-end replication and
geographically-dispersed clustering technologies. He is currently a
Principal Corporate Systems Engineer in the VPLEX Business Unit.
Typographical EMC uses the following type style conventions in this document:
conventions
Normal Used in running (nonprocedural) text for:
• Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
• Names of resources, attributes, pools, Boolean expressions,
buttons, DQL statements, keywords, clauses, environment
variables, functions, utilities
• URLs, pathnames, filenames, directory names, computer
names, filenames, links, groups, service keys, file systems,
notifications
Bold Used in running (nonprocedural) text for:
• Names of commands, daemons, options, programs, processes,
services, applications, utilities, kernels, notifications, system
calls, man pages
Used in procedures for:
• Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
• What user specifically selects, clicks, presses, or types
Italic Used in all text (including procedures) for:
• Full titles of publications referenced in text
• Emphasis (for example a new term)
• Variables
Courier Used for:
• System output, such as an error message or script
• URLs, complete paths, filenames, prompts, and syntax when
shown outside of running text
Courier bold Used for:
• Specific user input (such as commands)
Courier italic Used in procedures for:
• Variables on command line
• User input variables
<> Angle brackets enclose parameter or variable values supplied by
the user
[] Square brackets enclose optional values
This chapter provides a brief summary of the main use cases for the
EMC VPLEX family and design considerations for high availability. It
also covers some of the key features of the VPLEX family system.
Topics include:
◆ Introduction ........................................................................................ 18
◆ VPLEX value overview ..................................................................... 19
◆ VPLEX product offerings ................................................................. 23
◆ Metro high availability design considerations .............................. 28
Introduction
The purpose of this TechBook is to introduce EMC® VPLEX™ high
availability and the VPLEX Witness as it is conceptually
architectured, typically by customer storage administrators and EMC
Solutions Architects. The introduction of VPLEX Witness provides
customers with absolute physical and logical fabric and cache
coherent redundancy if it is properly designed in the VPLEX Metro
environment.
This TechBook is designed to provide an overview of the features and
functionality associated with the VPLEX Metro configuration and the
importance of active/active data resiliency for today’s advanced host
applications.
VPLEX Local
VPLEX Local provides seamless, non-disruptive data mobility and
ability to manage multiple heterogeneous arrays from a single
interface within a data center.
VPLEX Local allows increased availability, simplified management,
and improved utilization across multiple arrays.
Architecture highlights
VPLEX support is open and heterogeneous, supporting both EMC
storage and common arrays from other storage vendors, such as
HDS, HP, and IBM. VPLEX conforms to established worldwide
naming (WWN) guidelines that can be used for zoning.
VPLEX supports operating systems including both physical and
virtual server environments with VMware ESX and Microsoft
Hyper-V. VPLEX supports network fabrics from Brocade and Cisco,
including legacy McData SANs.
Note: For the latest information please refer to the ESSM (EMC
Simple Support Matrix) for supported host types as well as the
connectivity ESM for fabric and extended fabric support.
Features Benefits
Features Benefits
Advanced data caching Improve I/O performance and reduce storage array
contention.
Scale-out cluster architecture Start small and grow larger with predictable service
levels.
Introduction
This section provides basic information on the following:
◆ “VPLEX I/O” on page 32
◆ “High-level VPLEX I/O flow” on page 32
◆ “Distributed coherent cache” on page 33
◆ “VPLEX family clustering architecture ” on page 33
VPLEX I/O
VPLEX is built on a lightweight protocol that maintains cache
coherency for storage I/O and the VPLEX cluster provides highly
available cache, processing power, front-end, and back-end Fibre
Channel interfaces.
EMC hardware powers the VPLEX cluster design so that all devices
are always available and I/O that enters the cluster from anywhere
can be serviced by any node within the cluster.
The AccessAnywhere feature in the VPLEX Metro and VPLEX Geo
products extends the cache coherency between data centers at a
distance.
On reads from other engines, VPLEX checks the directory and tries to
pull the read I/O directly from the engine cache to avoid going to the
physical arrays to satisfy the read.
This model enables VPLEX to stretch the cluster as VPLEX distributes
the directory between clusters and sites. Due to the Hierarchical
nature of the VPLEX directory VPLEX is efficient with minimal
overhead and enables I/O communication over distance.
Introduction 33
Hardware and Software
Introduction 35
Hardware and Software
Upgrade paths
VPLEX facilitates application and storage upgrades without a service
window through its flexibility to shift production workloads
throughout the VPLEX technology.
In addition, high-availability features of the VPLEX cluster allow for
non-disruptive VPLEX hardware and software upgrades.
This flexibility means that VPLEX is always servicing I/O and never
has to be completely shut down.
Hardware upgrades
Upgrades are supported for single-engine VPLEX systems to dual- or
quad-engine systems.
A single VPLEX Local system can be reconfigured to work as a
VPLEX Metro or VPLEX Geo by adding a new remote VPLEX cluster.
Additionally an entire VPLEX VS1 Cluster (hardware) can be fully
upgraded to VS2 hardware non disruptively.
Information for VPLEX hardware upgrades is in the Procedure
Generator that is available through EMC PowerLink.
Software upgrades
VPLEX features a robust non-disruptive upgrade (NDU) technology
to upgrade the software on VPLEX engines and VPLEX Witness
servers. Management server software must be upgraded before
running the NDU.
Due to the VPLEX distributed coherent cache, directors elsewhere in
the VPLEX installation service I/Os while the upgrade is taking
place. This alleviates the need for service windows and reduces RTO.
The NDU includes the following steps:
◆ Preparing the VPLEX system for the NDU
◆ Starting the NDU
◆ Transferring the I/O to an upgraded director
◆ Completing the NDU
Web-based GUI
VPLEX includes a Web-based graphical user interface (GUI) for
management. The EMC VPLEX Management Console Help provides
more information on using this interface.
To perform other VPLEX operations that are not available in the GUI,
refer to the CLI, which supports full functionality. The EMC VPLEX
CLI Guide provides a comprehensive list of VPLEX commands and
detailed instructions on using those commands.
The EMC VPLEX Management Console contains but is not limited to
the following functions:
◆ Supports storage array discovery and provisioning
◆ Local provisioning
◆ Distributed provisioning
◆ Mobility Central
◆ Online help
VPLEX CLI
VPlexcli is a command line interface (CLI) to configure and operate
VPLEX systems. It also generates the EZ Wizard Setup process to
make installation of VPLEX easier and quicker.
The CLI is divided into command contexts. Some commands are
accessible from all contexts, and are referred to as ‘global commands’.
The remaining commands are arranged in a hierarchical context tree
that can only be executed from the appropriate location in the context
tree.
Management console
The VPLEX Management Console provides a graphical user interface
(GUI) to manage the VPLEX cluster. The GUI can be used to
provision storage, as well as manage and monitor system
performance.
Figure 7 on page 42 shows the VPLEX Management Console window
with the cluster tree expanded to show the objects that are
manageable from the front-end, back-end, and the federated storage.
The VPLEX Management Console provides online help for all of its
available functions. Online help can be accessed in the following
ways:
◆ Click the Help icon in the upper right corner on the main screen
to open the online help system, or in a specific screen to open a
topic specific to the current task.
◆ Click the Help button on the task bar to display a list of links to
additional VPLEX documentation and other sources of
information.
For information about the VPlexcli, refer to the EMC VPLEX CLI
Guide.
System reporting
VPLEX system reporting software collects configuration information
from each cluster and each engine. The resulting configuration file
(XML) is zipped and stored locally on the management server or
presented to the SYR system at EMC via call home.
You can schedule a weekly job to automatically collect SYR data
(VPlexcli command scheduleSYR), or manually collect it whenever
needed (VPlexcli command syrcollect).
Director software
The director software provides:
◆ Basic Input/Output System (BIOS ) — Provides low-level
hardware support to the operating system, and maintains boot
configuration.
◆ Power-On Self Test (POST) — Provides automated testing of
system hardware during power on.
◆ Linux — Provides basic operating system services to the Vplexcli
software stack running on the directors.
◆ VPLEX Power and Environmental Monitoring (ZPEM) —
Provides monitoring and reporting of system hardware status.
◆ EMC Common Object Model (ECOM) —Provides management
logic and interfaces to the internal components of the system.
◆ Log server — Collates log messages from director processes and
sends them to the SMS.
◆ EMC GeoSynchrony™ (I/O Stack) — Processes I/O from hosts,
performs all cache processing, replication, and virtualization
logic, interfaces with arrays for claiming and I/O.
Director software 45
Hardware and Software
Configuration overview
The VPLEX configurations are based on how many engines are in the
cabinet. The basic configurations are single, dual and quad
(previously know as small, medium and large).
The configuration sizes refer to the number of engines in the VPLEX
cabinet. The remainder of this section describes each configuration
size.
Dual configurations
The VPLEX dual engine configuration includes the following:
◆ Four directors
◆ Two engines
◆ Redundant engine SPSs
◆ 16 front-end Fibre Channel ports (32 for VS1 hardware)
◆ 16 back-end Fibre Channel ports (32 for VS1 hardware)
◆ One management server
Configuration overview 47
Hardware and Software
◆ Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Figure 10 shows an example of a medium configuration.
ON ON
I I
O O
OFF OFF
ON ON
I I
O O
OFF OFF
ON ON
I I
O O
OFF OFF
UPS B
UPS A
OFF OFF
O O
I I
ON ON
Management server
Engine 2
OFF OFF
O O
I I
ON ON
SPS 2
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX-000254
Quad configurations
The VPLEX quad engine configuration includes the following:
◆ Eight directors
◆ Four engines
◆ Redundant engine SPSs
◆ 32 front-end Fibre Channel ports (64 for VS1 hardware)
ON
I
O
ON
I
O
Engine 4
OFF OFF
SPS 4
ON ON
I I
O O
OFF OFF
Engine 3
ON ON
I I
O O
OFF OFF
SPS 3
UPS B
UPS A
OFF OFF
O O
I I
ON ON
Management server
Engine 2
OFF OFF
O O
I I
ON ON
SPS 2
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX 000253
Configuration overview 49
Hardware and Software
I/O implementation
The VPLEX cluster utilizes a write-through mode when configured
for either VPLEX Local or Metro whereby all writes are written
through the cache to the back-end storage. To maintain data integrity,
a host write is acknowledged only after the back-end arrays (in one
cluster in case of VPLEX Local and in two clusters in case of VPLEX
Metro) acknowledge the write.
This section describes the VPLEX cluster caching layers, roles, and
interactions. It gives an overview of how reads and writes are
handled within the VPLEX cluster and how distributed cache
coherency works. This is important to the introduction of high
availability concepts.
Cache coherence
Cache coherence creates a consistent global view of a volume.
Distributed cache coherence is maintained using a directory. There is
one directory per virtual volume and each directory is split into
chunks (4096 directory entries within each). These chunks exist only
if they are populated. There is one directory entry per global cache
page, with responsibility for:
◆ Tracking page owner(s) and remembering the last writer
◆ Locking and queuing
Meta-directory
Directory chunks are managed by the meta-directory, which assigns
and remembers chunk ownership. These chunks can migrate using
Locality-Conscious Directory Migration (LCDM). This
meta-directory knowledge is cached across the share group (i.e., a
group of multiple directors within the cluster that are exporting a
given virtual volume) for efficiency.
If the data is not found in local cache, VPLEX searches global cache.
Global cache includes all directors that are connected to one another
within the single VPLEX cluster for VPLEX Local, and all of the
VPLEX clusters for both VPLEX Metro and VPLEX Geo. If there is a
global read hit in the local cluster (i.e. same cluster, but different
director) then the read will be serviced from global cache in the same
cluster. The read could also be serviced by the remote global cache if
the consistency group setting “local read override” is set to false (the
default is true). Whenever the read is serviced from global cache
(same cluster or remote), a copy is also stored in the local cache of the
director from where the request originated.
If a read cannot be serviced from either local cache or global cache, it
is read directly from the back-end storage. In these cases both the
global and local cache are updated.
I/O implementation 51
Hardware and Software
Overview
VPLEX clusters are capable of surviving any single hardware failure
in any subsystem within the overall storage cluster. These include
host connectivity subsystem, memory subsystem, etc. A single failure
in any subsystem will not affect the availability or integrity of the
data. Multiple failures in a single subsystem and certain
combinations of single failures in multiple subsystems may affect the
availability or integrity of data.
High availability requires that host connections be redundant and
that hosts are supplied with multipath drivers. In the event of a
front-end port failure or a director failure, hosts without redundant
physical connectivity to a VPLEX cluster and without multipathing
software installed may be susceptible to data unavailability.
Cluster
A cluster is a collection of one, two, or four engines in a physical
cabinet. A cluster serves I/O for one storage domain and is managed
as one storage cluster.
All hardware resources (CPU cycles, I/O ports, and cache memory)
are pooled:
◆ The front-end ports on all directors provide active/active access
to the virtual volumes exported by the cluster.
◆ For maximum availability, virtual volumes can be presented
through all director so that all directors but one can fail without
causing data loss or unavailability. To achieve this with version
5.0.1 code and below directors must be connected to all storage.
Note: Instant failure of all directors bar one in a dual or quad engine
system would result in the last remaining director also failing since it
would lose quorum. This is, therefore, only true if one director failed at a
time.
Cluster 55
System and Component Integrity
is in place).
Serviceability
In addition to the redundancy fail-safe features, the VPLEX cluster
provides event logs and call home capability via EMC Secure Remote
Support (ESRS).
Note: To ensure the explanation of this subject remains at a high level in the
following section the graphics have been broken down into major objects
(e.g. Site A, Site B and Link). You can assume that within each site resides a
VPLEX cluster. Therefore, when a site failure is shown it will also cause a full
VPLEX cluster failure within that site. You can also assume that the link
object between sites represents the main inter-cluster data network connected
to each VPLEX cluster in either site. One further assumption is that each site
shares the same failure domain. A site failure will affect all components
within this failure domain including VPLEX cluster.
Figure 20 shows a total failure at one of the sites (in this case Site A
has failed). In this case the distributed volume would become
degraded since the hardware required at Site A to support this
particular mirror leg is no longer available. For a resolution to this
example, keep the volume active at Site B so the application can
resume there.
Note: VPLEX Metro also supports the rule set “no automatic winner”. If a
consistency group is configured with this setting then IO will suspend at both
VPLEX clusters if either the link were to partition or an entire VPLEX cluster
fail. Manual intervention can then be used to resume IO at a remaining
cluster if required. Care should be taken if setting this policy since although
this will always ensure that both VPLEX clusters remain identical at all times,
the trade off is that the production environment would be halted. This is
useful if a customer wishes to integrate VPLEX failover semantics with
failover behavior driven by the application (suppose the application has its
own witness, etc.) In this case, the application can provide a script that
invokes the resume CLI command on the VPLEX cluster of its choosing.
Figure 25 shows how static preference can be set for each distributed
volume (also known as a DR1 - Distributed RAID1).
This detach rule can either be set within the VPLEX GUI or via
VPLEX CLI.
Each volume can be either set to Cluster 1 detaches, Cluster 2
detaches or no automatic winner.
If the Distributed Raid 1 device (DR1) is set to Cluster 1 detaches,
then in any failure scenario the preferred cluster for that volume
would be declared as Cluster 1, but if the DR1 detach rule is set to
Cluster 2 detaches, then in any failure scenario the preferred cluster
for that volume would be declared as Cluster 2.
Note: Some people when looking at this prefer to substitute the word detaches
for the word preferred or wins which is perfectly acceptable and may make it
easier to understand.
Note: A caveat exists here that if the state of the BE at the preferred cluster is
out of date (due to prior BE failure, an incomplete rebuild or another issue)
the preferred cluster will suspend I/O regardless of preference.
The following diagrams show some examples of the rule set in action
for different failures, the first being a site loss at B with a single DR1
set to Cluster 1 detaches.
Figure 26 shows the initial running setup of the configuration. It can
be seen that the volume is set to Cluster 1 detaches.
If there was a problem at Site B, then the DR1 will become degraded
as shown in Figure 27.
The next example shows static preference working under link failure
conditions.
Figure 29 shows a configuration with a distributed volume set to
Cluster 1 detaches as per the previous configuration.
If the link were now lost then the distributed volume will again be
degraded as shown in Figure 30.
To ensure that split brain does not occur after this type of failure the
static preference rule is applied and I/O is suspended at Cluster 2 in
this case as the rule is set to Cluster 1 detaches.
As can be seen, the preferred site has now failed and the preference
rule has been used, but since the rule is “static” and cannot
distinguish between a link failure or remote site failure, in this
example the remaining site becomes suspended. Therefore, in this
case, manual intervention will be required to bring the volume online
at Site A.
Static preference is a very powerful rule. It does provide zero RPO
and zero RTO resolution for non-preferred cluster failure and
inter-cluster partition scenarios and it completely avoids split brain.
However, in the presence of a preferred cluster failure it provides
non-zero RTO. It is good to note that this feature is available without
automation and is a valuable alternative when a VPLEX Witness
configuration (discussed in the next chapter) is unavailable or
customer infrastructure cannot accommodate due to the lack of a
third failure domain.
Note: If using a VPLEX Metro deployment without VPLEX Witness, and the
preferred cluster has been lost, IO can be manually resumed via cli at the
remaining (non-preferred) VPLEX cluster. However, care should be taken
here to avoid a conflicting detach or split brain scenario. (VPLEX Witness
solves this problem automatically.)
Clearly, using the example of the third floor in the building, one
would not be protected from a disaster affecting the entire building
so, depending on the requirement, careful consideration should be
given if choosing this third failure domain.
Figure 36 shows the supported versions (at the time of writing) for
VPLEX Witness.
Check the latest VPLEX ESSM (EMC Simple Support Matrix), located
at https://elabnavigator.emc.com, Simple Support Matrix tab, for
the latest information including VPLEX Witness server physical host
requirements and site qualification.
In this case the VPLEX Witness guides both clusters to follow the
pre-configured static preference rules and volume access at cluster 1
will be suspended since the rule set was configured as cluster 2
detaches.
The next example shows how VPLEX Witness can assist if you have a
site failure at the preferred site. As discussed above, this type of
failure without VPLEX Witness would cause the volumes in the
surviving site to go offline. This is where VPLEX Witness greatly
improves the outcome of this event and removes the need for manual
intervention.
As discussed in the previous section, when a site has failed then the
distributed volumes are now degraded. However, unlike our
previous example where there was a site failure at the preferred site
and the static preference rule was used forcing volumes into a
suspend state at cluster 1, VPLEX Witness will now observe that
communication is still possible to cluster 1 (but not cluster 2).
Additionally since cluster 1 cannot contact cluster 2, VPLEX Witness
can make an informed decision and guide cluster 1 to override the
static rule set and proceed with I/O.
VPlexcli:/cluster-witness> ls
Attributes:
Name Value
------------- -------------
admin-state enabled
private-ip-address 128.221.254.3
public-ip-address 10.31.25.45
Contexts:
components
VPlexcli:/cluster-witness> ll components/
/cluster-Witness/components:
VPlexcli:/cluster-Witness> ll components/*
/cluster-Witness/components/cluster-1:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state of cluster-1 is in-contact (last
state change: 0 days, 13056 secs ago; last message
from server: 0 days, 0 secs ago.)
id 1
management-connectivity ok
operational-state in-contact
/cluster-witness/components/cluster-2:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
/cluster-Witness/components/server:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state is clusters-in-contact (last state
change: 0 days, 13056 secs ago.) (last time of
communication with cluster-2: 0 days, 0 secs ago.)
(last time of communication with cluster-1: 0 days, 0
secs ago.)
id -
management-connectivity ok
operational-state clusters-in-contact
Eefer to the VPLEX CLI guide found on Powerlink for more details
around VPLEX Witness CLI.
VPLEX Witness cluster As discussed in the previous section, deploying a VPLEX solution
isolation semantics with VPLEX Witness will give continuous availability to the storage
and dual failures volumes regardless of there being a site failure or inter-cluster link
failure. These types of failure are deemed single component failures
and have shown no single point of failure can induce data
unavailability using the VPLEX Witness.
It should be noted, however, that in rare situations more than one
fault or component outage can occur especially when considering
inter-cluster communication links which if two failed at once would
lead to a VPLEX cluster isolation at a given site.
For instance, if you consider a typical VPLEX Setup with VPLEX
Witness you will automatically have three failure domains (this
example will use A, B, and C, where VPLEX cluster 1 resides at A,
VPLEX cluster 2 at B, and the VPLEX Witness server resides at C). In
this case there will be in inter-cluster link between A and B (cluster 1
and 2), plus a management IP link between A and C, as well as a
management IP link between B and C, effectively giving a
triangulated topology.
In rare situations there is a chance that if the link between A and B
failed followed by a further link failure from either A to C or B to C,
then one of the sites will be isolated (cut off).
Due to the nature of VPLEX Witness, these types of isolation can also
be dealt with effectively without manual intervention.
This is achieved since a site isolation is very similar in terms of
technical behavior to a full site outage the main difference being that
the isolated site is still fully operational and powered up (but needs
to be forced into I/O suspension) unlike a site failure where the failed
site is not operational.
In these cases the failure semantics and VPLEX Witness are
effectively the same. However, two further actions are taken at the
site that becomes isolated:
◆ I/O is shut off/suspended at the isolated site.
◆ The VPLEX cluster will attempt to call home.
Figure 44 shows the three scenarios that are described above:
Note: If best practices are followed then the likelihood of these scenarios
occurring is significantly less than even the rare isolation incidents discussed
above mainly as the faults would have to disrupt components in totally
different fault domains that would be spread over many miles.
Figure 45 Highly unlikely dual failure scenarios that require manual intervention
A point to note in the above scenarios is that for the shown outcomes
to be correct the failures would have to have happened in a specific
order where the link to the VPLEX Witness (or the Witness itself) has
failed and then either the inter-cluster link or the VPLEX cluster fail.
However, if the order of failure is reversed then in all three cases the
outcome would be different since one of the VPLEX clusters would
have remained online for the given distributed volume, therefore not
requiring manual intervention.
This is due to the fact that once a failure occurs, the VPLEX Witness
will give guidance to the VPLEX cluster. This guidance is “sticky”,
and once provided its guidance it is no longer consulted during any
subsequent failure until the system has been returned to a fully
operational state. (i.e.has fully recovered and connectivity between
both clusters and the VPLEX Witness is fully restored).
Figure 46 Two further dual failure scenarios that would require manual
intervention
Note: It is always considered best practice to ensure ESRS and alerting are
fully configured when using VPLEX Witness. This way if a VPLEX cluster
loses communication with a Witness server then the VPLEX cluster will dial
home and alert. This also ensures that if both VPLEX clusters lose
communication to the witness that the witness function can be manually
disabled if the witness communication or outage is expected to last for an
extended time reducing the risk of data unavailability in the event of an
additional VPLEX cluster failure or WAN partition.
VPLEX Metro HA
VPLEX Metro HA 99
VPLEX Metro HA
Note: At the time of writing the only two qualified host cluster solutions
that can be configured with the additional VPLEX Metro HA campus (as
opposed to standard VPLEX Metro HA without the cross-cluster
connect) are vSphere version 4.1 & 5.0, Windows 2008 and IBM Power
HA 5.4. Be sure to check the latest VPLEX Simple Support Matrix, located
at https://elabnavigator.emc.com, Simple Support Matrix tab, for the
latest support information or submit an RPQ.
Failure scenarios
For the following failure scenarios, this section assumes that vSphere
5.0 update 1 or above is configured in a stretched HA topology with
DRS so that all of the physical hosts (ESX servers) are within the same
HA cluster. As discussed previously, this type of configuration brings
the ability to teleport virtual machines over distance, which is
extremely useful in disaster avoidance, load balancing and cloud
infrastructure use cases. These use cases are all enabled using out of
the box features and functions; however, additional value can be
derived from deploying the VPLEX Metro HA campus solution to
ensure total availability for both planned and unplanned events.
High-level recommendations and pre-requisites for stretching
vSphere HA when used in conjunction with VPLEX Metro are as
follows:
◆ A single vCenter instance must span both locations that contain
the VPLEX Metro cluster pairs. (Note: it is recommended this is
virtualized and protected via vSphere heartbeat to ensure restart
in the event of failure)
◆ Must be used in conjunction with a stretched layer 2 network
ensuring that once a VM is moved it still resides on the same
logical network.
◆ vSphere HA can be enabled within the vSphere cluster, but
vSphere fault tolerance is not supported (at the time of writing
but is planned for late 2012).
◆ Can be used with vSphere DRS. However, careful consideration
should be given to this feature for vSphere versions prior to 5.0
update 1 since certain failure conditions where the VM is running
at the “non preferred site” may not invoke a VM fail over after
failure due to a problem where the ESX server does not detect a
storage “Persistent Device Loss (PDL)” state. This can lead to the
VM remaining online but intermittently unresponsive (also know
as a zombie VM). Manual intervention would be required in this
scenario.
Example 1 Figure 49 shows all physical ESX hosts failing in domain A1.
Since all of the physical hosts in domain B1 are connected to the same
datastores via the VPLEX Metro distributed device VMware HA can
restart the virtual machines on any of the physical ESX hosts in
domain B1.
Example 2 The next example describes what will happen in the unlikely event
that a VPLEX cluster was to fail in either domain A2 or B2.
Since the ESX servers are cross connected to both VPLEX clusters in
each site, ESX will simply re-route the I/O to the alternate path,
which is still available since VPLEX is configured with a VPLEX
Witness protected distributed volume. This ensures the distributed
volume remains online in domain B2 as the VPLEX Witness Server
observes that it cannot communicate with the VPLEX cluster in A2
and guides the VPLEX cluster in B2 to remain online as this also
cannot communicate with A2, meaning A2 is either isolated or failed.
Note: Similarly in the event of a full isolation at A2, the distributed volumes
would simply suspend at A2 since communication would not be possible to
either the VPLEX Witness Server or the VPLEX cluster in domain B2. In this
case, the outcome is identical from a vSphere perspective and there will be no
interruption since I/O would be re-directed across the cross connect to
domain B2 where the distributed volume would remain online and available
to service I/O.
Example 3 The following example describes what will happen in the event of a
failure to one (or all of) the back end storage arrays in either domain
A3 or B3. Again, in this instance there would be no interruption to
any of the virtual machines.
Figure 51 shows the failure to all storage arrays that reside in domain
A3. Since a cache coherent VPLEX Metro distributed volume is
configured between domains A2 and B2 IO can continue to be
actively serviced from the VPLEX in A2 even though the local back
end storage has failed. This is due to the embedded VPLEX cache
coherency which will efficiently cache any reads into the A2 domain
whilst also propagating writes to the back end storage in domain B3
via the remote VPLEX cluster in Site B2.
Example 4 The next example describes what will happen in the event of a
VPLEX Witness server failure in domain C1.
Again, in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters. Figure 52 shows a complete
failure to domain C1 where the VPLEX Witness server resides. Since
the VPLEX Witness in not within the I/O path and is only an optional
component I/O will actively continue for any distributed volume in
domains A2 and B2 since the inter-cluster link is still available,
meaning cache coherency can be maintained between the VPLEX
cluster domains.
Although the service is uninterrupted, both VPLEX clusters will now
dial home and indicate they have lost communication with the
VPLEX Witness Server as a further failure to either of the VPLEX
clusters in domains A2 and B2 or the inter-cluster link would cause
data unavailability. The risk of this is heightened should the VPLEX
Witness server be offline for an extended duration. To remove this
risk the Witness feature may be disabled manually, therefore enabling
the VPLEX clusters to follow the static preference rules.
Example 5 The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
Again in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters.
Figure 53 shows the inter-cluster link has failed between domains A2
and B2. In this instance the static preference rule set which was
defined previously will be invoked since neither VPLEX cluster can
communicate with the other VPLEX cluster (but the VPLEX Witness
Server can communicate with both VPLEX clusters). Therefore, access
to the given distributed volume within one of the domains A2 or B2
will be suspended. Since in this example the cross connect network is
physically separate from inter cluster link the alternate paths are still
available to the remote VPLEX cluster where the volume remains
online, therefore ESX will simply re-route the traffic to the alternate
VPLEX cluster meaning the virtual machine will remain online and
unaffected whichever site it was running on.
Note: At the time of writing a larger amount of host based cluster solutions
are supported with VPLEX Metro HA when compared to VPLEX Metro HA
campus (with the cross-connect). Although this next section only discusses
vSphere with VPLEX Metro HA, other supported host clusters include
Microsoft HyperV, Oracle RAC, Power HA, Serviceguard, etc. Be sure to
check the latest VPLEX simple support matrix found at
https://elabnavigator.emc.com, Simple Support Matrix tab, for the latest
support information.
Failure scenarios
As with the previous section, when deploying a stretched vSphere
configuration with VPLEX Metro HA, it is also possible to enable
long distance vMotion (virtual machine teleportation) since the ESX
datastore resides on a VPLEX Metro distributed volume, therefore
existing in two places at the same time exactly as described in the
previous section.
Again, for these failure scenarios, this section assumes that vSphere
version 5.0 update 1 or higher is configured in a stretched HA
topology so that all of the physical hosts at either site (ESX servers)
are within the same HA cluster.
Since the majority of the failure scenarios behave identically to the
cross-connect configuration, this section will only show two failure
scenarios where the outcomes differ slightly to the previous section.
Note: For detailed setup instructions and best practice planning for a
stretched HA vSphere environment please read White Paper: Using VMware
vSphere with EMC VPLEX — Best Practices Planning, which can be found on
Powerlink (http://powerlink.emc.com) under Home > Support > Technical
Documentation and Advisories > Hardware/Platforms Documentation >
VPLEX Family > White Papers.
Example 1 The following example describes what will happen in the unlikely
event that a VPLEX cluster was to fail in domain A2. In this instance
there would no interruption of service to any virtual machine’s
running in domain B1; however, any virtual machine’s that were
running in domain A1 would see a minor interruption as the virtual
machine’s are restarted at B1.
Figure 56 shows a full VPLEX cluster outage in domain A2.
As can be seen in the graphic, since the ESX servers are not
cross-connected to the remote VPLEX cluster, the ESX server will lose
access to the storage causing the HA host cluster (in this case
vSphere) to perform a HA restart for the virtual machines within
domain A2. It can do this since the distributed volumes will remain
active at B2 as the VPLEX is configured with a VPLEX Witness
protected distributed volume which will deduce that the VPLEX in
domain A2 is unavailable (since neither the VPLEX Witness Server or
the VPLEX cluster in B2 can communicate with the VPLEX cluster in
A2, therefore VPLEX Witness will guide the VPLEX cluster in B2 to
remain online)
Example 2 The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
One of two outcome of this scenario will happen:
◆ VMs running at the preferred site - If the static preference for a
given distributed volume was set to cluster 1 detaches (assuming
cluster 1 resides in domain A2) and the virtual machine was
running at the same site where the volume remains online (aka
the preferred site) then there is no interruption to service.
◆ VMs running at the non-preferred site - If the static preference
for a given distributed volume was set to cluster 1 detaches
(assuming cluster 1 resides in domain A2) and the virtual
machine was running at the remote site (Domain B1) then the
VM’s storage will be in the suspended state (PDL). In this case
the guest operating systems will fail allowing the virtual machine
to be automatically restarted in domain A1.
Figure 57 shows the link has failed between domains A2 and B2.
In this instance the static preference rule set which was previously
defined as cluster 1 detaches will be invoked since neither VPLEX
cluster can communicate with the other VPLEX cluster (but the
VPLEX Witness Server can communicate with both VPLEX clusters),
therefore access to the given distributed volume within the domains
B2 will be suspended for the given distributed volume whilst
remaining active at A2.
Virtual machines that were running at A1 will be uninterrupted and
virtual machine’s that were running at B1 will be restarted at A1.
Note: Similar to the previous note and though it is beyond the scope of this
TechBook, vSphere HA versions prior to 5.0 update 1 may not detect this
condition and not restart the VM if it was running at the non preferred site,
therefore to avoid any disruption when using vSphere in this type of
configuration (for versions prior to 5.0 update 1) VMware DRS host affinity
rules can be used (where supported) to ensure that virtual machines are
always running in their preferred location (i.e., the location that the storage
they rely on is biased towards). Another way to avoid this scenario is to
disable DRS altogether and use vSphere HA only, or use a cross-connect
configuration deployed across a separate physical network as discussed in
the previous section. See Appendix A, “vSphere 5.0 Update 1 Additional
Settings,” for proper vSphere HA configuration settings that are applicable
for ESX implementations post version 5.0 update 1 to avoid this problem.
The remaining failure scenarios with this solution are identical to the
previously discussed VPLEX Metro HA campus solutions. For failure
handling in domains A1, B1, A3, B3, or C, see “VPLEX Metro HA
Campus (with cross-connect)” on page 101.
Conclusion
Conclusion 119
Conclusion
Conclusion
As outlined in this book, using VPLEX AccessAnywhereTM
technology in combination with High Availability and VPLEX
Witness, storage administrators and data center managers will be
able to provide absolute physical and logical high availability for
their organizations’ mission critical applications with less resource
overhead and dependency on manual intervention. Increasingly,
those mission critical applications are virtualized and in most cases
using VMware vSphere or Microsoft Hyper-V “virtual machine”
technologies. It is expected that VPLEX customers use the HA /
VPLEX Witness solution to incorporate several application-specific
clustering and virtualization technologies to provide HA benefits for
targeted mission critical applications.
As described, the storage administrator is provided with two specific
VPLEX Metro-based solutions around High Availability as outlined
specifically for VMware ESX 4.1 or higher as integrated into the
VPLEX Metro HA Campus (cross-cluster connect) and standard (non
campus) Metro environments. VPLEX Metro HA Campus provides a
higher level of HA than the VPLEX Metro HA deployment without
cross-cluster connectivity. However, it is limited to in-data center use
or cases where the network latency between data centers is
negligible.
Both solutions are ideal for customers who are not only currently or
planning on becoming highly virtualized but are looking for the
following:
◆ Elimination of the “night shift” storage and server administrator
positions. To accomplish this, they must be comfortable that their
applications will ride through any failures that happen during
the night.
◆ Reduction of capital expenditures by moving from an
active/passive data center replication model to a fully active
highly available data center model.
◆ Increase application availability by protecting against flood and
fire disasters that could affect their entire data center.
Conclusion 121
Conclusion
A
AccessAnywhere The breakthrough technology that enables VPLEX clusters to provide
access to information between clusters that are separated by distance.
active/active A cluster with no primary or standby servers, because all servers can
run applications and interchangeably act as backup for one another.
array A collection of disk drives where user data and parity data may be
stored. Devices can consist of some or all of the drives within an
array.
asynchronous Describes objects or events that are not coordinated in time. A process
operates independently of other processes, being initiated and left for
another task before being acknowledged.
For example, a host writes data to the blades and then begins other
work while the data is transferred to a local disk and across the WAN
asynchronously. See also ”synchronous.”
B
bandwidth The range of transmission frequencies a network can accommodate,
expressed as the difference between the highest and lowest
frequencies of a transmission cycle. High bandwidth allows fast or
high-volume transmissions.
bias When a cluster has the for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness). This is now known as
preference.
block The smallest amount of data that can be transferred following SCSI
standards, which is traditionally 512 bytes. Virtual volumes are
presented to users as a contiguous lists of blocks.
C
cache Temporary storage for recent writes and recently accessed data. Disk
data is read through the cache so that subsequent read references are
found in the cache.
cache coherency Managing the cache so data is not lost, corrupted, or overwritten.
With multiple processors, data blocks may have several copies, one in
the main memory and one in each of the cache memories. Cache
coherency propagates the blocks of multiple users throughout the
system in a timely fashion, ensuring the data blocks do not have
inconsistent versions in the different processors caches.
controller A device that controls the transfer of data to and from a computer and
a peripheral device.
D
data sharing The ability to share access to the same data with multiple servers
regardless of time and location.
detach rule A rule set applied to a DR1 to declare a winning and a losing cluster
in the event of a failure.
director A CPU module that runs GeoSynchrony, the core VPLEX software.
There are two directors in each engine, and each has dedicated
resources and is capable of functioning independently.
dirty data The write-specific data stored in the cache memory that has yet to be
written to disk.
disaster recovery (DR) The ability to restart system operations after an error, preventing data
loss.
disk cache A section of RAM that provides cache between the disk and the CPU.
RAMs access time is significantly faster than disk access time;
therefore, a disk-caching program enables the computer to operate
faster by placing recently accessed data in the disk cache.
distributed file system Supports the sharing of files and resources in the form of persistent
(DFS) storage over a network.
Distributed RAID1 A cache coherent VPLEX Metro or Geo volume that is distributed
device (DR1) between two VPLEX Clusters
E
engine Enclosure that contains two directors, management modules, and
redundant power.
Ethernet A Local Area Network (LAN) protocol. Ethernet uses a bus topology,
meaning all devices are connected to a central cable, and supports
data transfer rates of between 10 megabits per second and 10 gigabits
per second. For example, 100 Base-T supports data transfer rates of
100 Mb/s.
event A log message that results from a significant action initiated by a user
or the system.
F
failover Automatically switching to a redundant or standby device, system,
or data path upon the failure or abnormal termination of the
currently active device, system, or data path.
Fibre Channel (FC) A protocol for transmitting data between computer devices. Longer
distance requires the use of optical fiber; however, FC also works
using coaxial cable and ordinary telephone twisted pair media. Fibre
channel offers point-to-point, switched, and loop interfaces. Used
within a SAN to carry SCSI traffic.
field replaceable unit A unit or component of a system that can be replaced on site as
(FRU) opposed to returning the system to the manufacturer for repair.
firmware Software that is loaded on and runs from the flash ROM on the
VPLEX directors.
G
Geographically A system physically distributed across two or more Geographically
distributed system separated sites. The degree of distribution can vary widely, from
different locations on a campus or in a city to different continents.
gigabit Ethernet The version of Ethernet that supports data transfer rates of 1 Gigabit
per second.
I
input/output (I/O) Any operation, program, or device that transfers data to or from a
computer.
internet Fibre Channel Connects Fibre Channel storage devices to SANs or the Internet in
protocol (iFCP) Geographically distributed systems using TCP.
intranet A network operating like the World Wide Web but with access
restricted to a limited group of authorized users.
K
kilobit (Kb) 1,024 (2^10) bits. Often rounded to 10^3.
L
latency Amount of time it requires to fulfill an I/O request.
local area network A group of computers and associated devices that share a common
(LAN) communications line and typically share the resources of a single
processor or server within a small Geographic area.
logical unit number Used to identify SCSI devices, such as external hard drives,
(LUN) connected to a computer. Each device is assigned a LUN number
which serves as the device's unique address.
M
megabit (Mb) 1,048,576 (2^20) bits. Often rounded to 10^6.
metadata Data about data, such as data quality, content, and condition.
metavolume A storage volume used by the system that contains the metadata for
all the virtual volumes managed by the system. There is one metadata
storage volume per cluster.
mirroring The writing of data to two or more disks simultaneously. If one of the
disk drives fails, the system can instantly switch to one of the other
disks without losing data or service. RAID 1 provides mirroring.
miss An operation where the cache is searched but does not contain the
data, so the data instead must be accessed from disk.
N
namespace A set of names recognized by a file system in which all names are
unique.
network partition When one site loses contact or communication with another site.
P
parity The even or odd number of 0s and 1s in binary code.
parity checking Checking for errors in binary data. Depending on whether the byte
has an even or odd number of bits, an extra 0 or 1 bit, called a parity
bit, is added to each byte in a transmission. The sender and receiver
agree on odd parity, even parity, or no parity. If they agree on even
parity, a parity bit is added that makes each byte even. If they agree
on odd parity, a parity bit is added that makes each byte odd. If the
data is transmitted incorrectly, the change in parity will reveal the
error.
preference When a cluster has the for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness). This was previously know as .
R
RAID The use of two or more storage volumes to provide better
performance, error recovery, and fault tolerance.
RAID 1 Also called mirroring, this has been used longer than any other form
of RAID. It remains popular because of simplicity and a high level of
data availability. A mirrored array consists of two or more disks. Each
disk in a mirrored array holds an identical image of the user data.
RAID 1 has no striping. Read performance is improved since either
disk can be read at the same time. Write performance is lower than
single disk storage. Writes must be performed on all disks, or mirrors,
in the RAID 1. RAID 1 provides very good data reliability for
read-intensive applications.
RAID leg A copy of data, called a mirror, that is located at a user's current
location.
remote direct Allows computers within a network to exchange data using their
memory access main memories and without using the processor, cache, or operating
(RDMA) system of either computer.
Recovery Point The amount of data that can be lost before a given failure event.
Objective (RPO)
Recovery Time The amount of time the service takes to fully recover after a failure
Objective (RTO) event.
S
scalability Ability to easily change a system in size or configuration to suit
changing conditions, to grow with your needs.
small computer A set of evolving ANSI standard electronic interfaces that allow
system interface personal computers to communicate faster and more flexibly than
(SCSI) previous interfaces with peripheral hardware such as disk drives,
tape drives, CD-ROM drives, printers, and scanners.
split brain Condition when a partitioned DR1 accepts writes from both clusters.
This is also known as a conflicting detach.
storage RTO The amount of time taken for the storage to be available after a failure
event (In all cases this will be a smaller time interval than the RTO
since the storage is a pre-requisite).
stripe depth The number of blocks of data stored contiguously on each storage
volume in a RAID 0 device.
striping A technique for spreading data over multiple disk drives. Disk
striping can speed up operations that retrieve data from disk storage.
Data is divided into units and distributed across the available disks.
RAID 0 provides disk striping.
T
throughput 1. The number of bits, characters, or blocks passing through a data
communication system or portion of that system.
2. The maximum capacity of a communications channel or system.
3. A measure of the amount of work performed by a system over a
period of time. For example, the number of I/Os per day.
tool command A scripting language often used for rapid prototypes and scripted
language (TCL) applications.
transmission control The basic communication language or protocol used for traffic on a
protocol/Internet private network and the Internet.
protocol (TCP/IP)
U
uninterruptible power A power supply that includes a battery to maintain power in the
supply (UPS) event of a power failure.
universal unique A 64-bit number used to uniquely identify each VPLEX director. This
identifier (UUID) number is based on the hardware serial number assigned to each
director.
V
virtualization A layer of abstraction implemented in software that servers use to
divide available physical storage into storage volumes or virtual
volumes.
virtual volume A virtual volume looks like a contiguous volume, but can be
distributed over two or more storage volumes. Virtual volumes are
presented to hosts.
VPLEX Cluster Witness A new feature in VPLEX V5.x that can augment and improve upon
the failure handling semantics of Static .
W
wide area network A Geographically dispersed telecommunications network. This term
(WAN) distinguishes a broader telecommunication structure from a local
area network (LAN).
world wide name A specific Fibre Channel Name Identifier that is unique worldwide
(WWN) and represented by a 64-bit unsigned binary value.