Professional Documents
Culture Documents
ABSTRACT
A key function of an enterprise-class storage system is to host data in
a safe and reliable manner. The storage must provide continuous,
uninterrupted access to data, meet stringent performance
requirements, and deliver advanced functionality to streamline
operations and simplify data management.
This white paper defines high availability, data protection and data
integrity, and examines how XtremIO’s unique hardware and software
design achieves the utmost in uptime and resiliency against failures.
The combination of a scale-out design with a service oriented and
modular software architecture allows XtremIO to operate as a unified
system with the ability to adapt independent modules in case of
unexpected hardware failures. This paper details the monitoring, the
redundancy levels, the integrity checks and the extreme flexibility in
the architecture to maintain system performance and data availability
by adjusting to failures.
XtremIO High Availability and Data Protection Architecture
Copyright © 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change
without notice.
The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with
respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
EMC2, EMC, the EMC logo, and the RSA logo are registered trademarks or trademarks of EMC Corporation in the United States
and other countries. VMware is a registered trademark of VMware, Inc. in the United States and/or other jurisdictions. All other
trademarks used herein are the property of their respective owners. © Copyright 2014 EMC Corporation. All rights reserved.
Published in the USA. 02/14 White Paper H12914
ABSTRACT ............................................................................................................................................. 1
INTRODUCTION ..................................................................................................................................... 5
END-TO-END VERIFICATION................................................................................................................ 15
Hardware verification ......................................................................................................................... 15
Cryptographic data fingerprint ............................................................................................................. 15
Separate message path for data and cryptographic fingerprint................................................................. 15
CONCLUSION ....................................................................................................................................... 19
CONTACT US ........................................................................................................................................ 19
Even specialized storage systems are built on software and general-purpose computing components that can all fail. Some
failures may be immediately visible, such as a disk or SSD failure. Others can be subtle, such as not having enough memory
resources that results in performance issues. To ensure high availability and data integrity in such failures, the best storage
systems have an architecture that will maintain I/O flow as long as data protection is not at risk, and include various data
integrity checks that are generally optimized for system performance.
This white paper defines high availability, data protection and data integrity, and examines how XtremIO’s unique hardware and
software design achieves the utmost in uptime and resiliency against failures. The combination of a scale-out design with a
service oriented and modular software architecture allows multiple XtremIO X-Brick nodes to operate as a single system allows
XtremIO to operate as a unified system with the ability to adapt independent modules in case of unexpected hardware failures.
This paper details the monitoring, the redundancy levels, the integrity checks and the extreme flexibility in the architecture to
maintain system performance and data availability by adjusting to failures.
Audience
This white paper is intended for EMC customers, technical consultants, partners, and members of the EMC and partner
professional services community who are interested in learning more about XtremIO’s architecture to achieve High Availability
Data Integrity, and Data Protection.
There are two types of redundancy: passive redundancy and active redundancy. Passive redundancy provisions excess
components that are idle and are not operational unless the primary component fails. Two examples of passive redundancy in
enterprise storage are active/passive controller designs (where one controller serves I/O and the second controller does not
serve I/O unless the primary controller fails) and “hot spare” drives, which are designated spare drives in the system waiting to
be used upon failure of another drive. In general, a passive design wastes resources and cost by having additional hardware that
is rarely used but is part of the system. An active redundancy design maintains activity on all system components and ideally
balances all of them, getting the highest utilization of resources and the least impact upon any component failure. It is highly
desirable to have an active redundancy system.
While a system should always maintain availability upon any single failure, the best system designs do not lose data even during
dual simultaneous failures.
DATA INTEGRITY
The primary function of an enterprise storage array is to reliably store user data. When a host reads data, the storage system
must provide the correct data stored at the requested location. The accuracy of the data must be validated from the reception of
data by the storage system, through travel within the system, to the data being written to the back-end storage medium (end-
to-end verification). In order to verify that the data is correct upon a read, the system needs to create a fingerprint based upon
the stored data, and check the fingerprint when reading the data. The fingerprint ensures that the data has not changed at rest
or in flight. Ideally, the system should use independent locations for the data and its fingerprint. This reduces the probability of
any single component affecting the data and fingerprint in the same way, which could lead to a false indication that the data is
good while it is not. The worst thing a storage system can do is to provide corrupted data to the host while indicating that the
data is good.
Each XtremIO Storage Controller and array enclosure has dual power supplies and each has power provided from two separate
power circuits (per XtremIO installation best practices).
There are six main module types in the system and multiple instances of each can be running in the system independently.
Three module types are infrastructure modules and are responsible for system wide management, availability, and services for
other modules. The other three module types are I/O modules responsible for data services with the array and host
communication.
Infrastructure Modules
System Wide Management (SYM) Module
The System Wide Management module has a complete view of all the hardware and software components. It is responsible for
system availability and initiates any changes in system configuration to achieve maximum availability and redundancy. It makes
the decisions about which modules will execute on which Storage Controller, initiates failovers of data ownership from one
Storage Controller to another, and initiates rebuilds upon SSD failures. For redundancy purposes, multiple SYM modules are
running at the same time in the system, but at any single point in time only one is the active management entity and is the sole
entity to make system wide decisions. Should the component running the active SYM module fail, another SYM module quickly
becomes active and takes over.
Additional software logic, which runs on every Storage Controller, has the responsibility to verify that there is one, and only one
SYM active in the system. This simple process eliminates the possibility of not having any SYM module running.
I/O Modules
The I/O Modules are responsible for storing data from hosts and retrieving it upon request. They are each run on every Storage
Controller (although the XtremIO architecture is flexible enough to run different modules on different controllers, this is not done
today). As mentioned before, the assignment of which module runs on each controller is done by the SYM. Every I/O passes
through all three types of I/O modules (Routing, Control, and Data).
Routing Module
The Routing Module is the only entity in the system that communicates with the Host. It accepts SCSI commands from the Host
and parses them. It is stateless and simply translates the requests into volume and Logical Block Addresses (LBAs). It then
forwards the request to the appropriate Control Module (and Storage Controller) that manages those LBAs. The Routing Module
inherently balances load across the entire XtremIO clustered system. It runs a content-based fingerprinting function that results
in data being evenly distributed across all the X-Bricks in the system. Please refer to the Introduction to the XtremIO All-Flash
Array White Paper for a detailed explanation of this process.
Control Module
The Control Module is responsible for translating the Host user address (Logical Block Address) to an XtremIO internal mapping.
It acts as a virtualization layer between the Host SCSI Volume/LBA and XtremIO back-end deduplicated location. Having this
virtualization layer provides the ability to efficiently implement a range of rich data services. Data stored on XtremIO is content
addressable: its location in the array is determined according to its content, and not based on its address as in other storage
products. The LBAs of every volume on an XtremIO array are distributed among many Control Modules.
Data Module
The Data Module is responsible for storing data on the SSDs. It works as a service to the Control Module where the Control
Module provides a content fingerprint and the Data Module will write or read the data according to that fingerprint. There are
only three basic operations the Data Module executes: Read, Write or Erase a block. The goal is to keep the module as simple as
possible to maintain a robust and reliable system design. The Control Module does not need to worry about XtremIO Data
Protection (XDP) allocation. Centralizing the XDP scheme in the Data Module provides flexibility and efficiency in the system.
In the same manner that the Control Module evenly maps Host address to content fingerprint, the Data Module evenly maps
content fingerprint to physical location on SSD. This process guarantees that the data is balanced not only across all Storage
Controllers, but also across all SSDs in the array. This additional translation layer also allows the Data Module to place the data
optimally on the SSDs. Even in challenging scenarios like failed components, minimal free space, and frequent data overwrite,
XDP can find optimal locations to store data in the system. (To learn more about how XDP provides redundancy and flash-
optimized data placement, please see the EMC whitepaper titled “XtremIO Data Protection”.)
Restarting Modules
Since all modules run in user space, XIOS can quickly restart modules as needed. Any software failures or questionable behavior
in a module results in an automated module restart. The restarts are non-disruptive and generally undetectable at the user
level. This capability also serves as the foundation for Non-disruptive Upgrades (NDU).
Infiniband
A copy of all array metadata is stored in the Storage Controller memory. Updated metadata is synchronously replicated over
Infiniband RDMA in a distributed fashion to one or more physical Storage Controllers so that every real-time change is protected
in multiple locations. In a system with more than one X-Brick, the journal data on each node is protected by all other nodes
using a distributed replication process. The system-wide Management module manages the journal replication relationships
between Storage Controllers in the cluster. For resiliency reasons, a Storage Controller that is on backup battery power cannot
be a target for replicated journal data. If, for any reason, the Storage Controller cannot write the journal data to a separate
Storage Controller then it will write it locally to its SSDs as a final fail-safe. If a controller fails, the replicated journal is used to
rebuild the lost contents from the failed controller. All journal contents are periodically de-staged to SSD non-volatile storage.
In the event of a power loss, the system’s battery backup units allow this de-staging to take place and for the system to
complete an orderly shutdown. Certain highly critical metadata is de-staged and stored using triple replication, while other less-
critical metadata (that can be recovered in other ways) is stored using the same XDP scheme as user data.
CONNECTIVITY REDUNDANCY
XtremIO’s connectivity maintains communications redundancy to every system component (see Tables 1 and 2).
Not only does every component have at least two paths for communication, the management communication is on a separate
network from the data flow. Host I/O is done via Fibre Channel or iSCSI ports, while management of the system is done via
dedicated Ethernet management ports on each Storage Controller. Such a design allows separation of control from the I/O path.
Monitoring on a different network gives the ability to correlate events and system health independent of load or I/O behavior.
Each Storage Controller has two iSCSI Connect each port to separate SAN switch
ports
Each Storage Controller has two Each port connects to an independent Infiniband
Infiniband ports fabric to provide fault tolerance against Infiniband
component failures
Two Infiniband switches (when more than Each switch connects to every system Storage
one X-Brick in system) Controller and protects against Infiniband switch
failure.
Two Infiniband interconnect cables (with Redundant Infiniband paths between the two Storage
one X-Brick in a system) Controllers
Each array enclosure (DAE) has two SAS Failure of a SAS controller module does not result in
controller modules loss of connectivity between the DAE and X-Brick
Storage Controllers.
Each array enclosure (DAE) SAS controller Redundant SAS paths ensure that SAS port or SAS
module utilizes two SAS cables cable failures do not cause service loss
Loss of power on both circuits System performs de-stage of No service until power is
RAM to non-volatile storage and restored.
performs an orderly shutdown
XtremIO uses standard x86 servers, interface cards, Infiniband components, and eMLC SSDs. These components all include very
mature and robust hardware verification steps. EMC XtremIO avoids custom hardware modules in the array: Custom hardware
requires substantial engineering work to achieve the same level of resiliency that is readily available in standard enterprise-
proven components.
Fault Detection
The System Wide Management Module (SYM) continuously monitors and detects hardware and software faults in the system. It
continuously monitors Storage Controllers, Disk Array Enclosures (DAE), Fibre Channel HBAs, Ethernet NICs, Infiniband HCAs,
Infiniband Switches, and Battery Backup Units. The SYM also continuously monitors the SCSI driver, HBA controller drivers,
Linux kernel, and battery communication software components.
Every component and every data path used in the system has its own error detection method (see “End-to-End Verification”
earlier in this document). For instance, the eMLC SSDs XtremIO have an LBA-seeded 32-bit CRC for ECC mis-correct detection
and on-the-fly correction. The SSD also has 22-bit correction for each and every 512-byte sector and hardware based RAID-5
within the SSD itself to protect against internal flash module failures. This is separate and in addition to XtremIO’s XDP
technology and adds orders of magnitude greater resiliency than typical in consumer MLC (cMLC) SSDs.
Fault Prevention
In any system, hardware components may have occasional faults, thus it is important to isolate defective areas and refrain from
using them. XtremIO can isolate Storage Controllers, communications ports, SSDs and can even isolate portions of SSDs. For
example, if SSD sectors suffer uncorrectable corruption for any reason, XtremIO will refrain from using the affected addresses.
This is an added level of logical prevention on top of the SSD flash hardware controller that isolates defective flash components.
Advanced Healing
As previously mentioned, the SYM will automatically restart software components upon failure and can also reallocate software
to different Storage Controllers upon hardware failures. For instance, if the SYM recognizes that the service to accept I/O from
hosts (the “R Module”) is not running, it will restart it automatically. This capability ensures utmost availability and optimized
service levels at all times.
The XtremIO array identifies unexpected data differences due to the fingerprint check upon reading from the SSD. Upon
detection of such an inconsistency, XtremIO automatically rebuilds the missing data from all possible sources. This can be as
simple as rereading the data from the SSD in case the issue is transient. If not able to read the data (or if the re-read also
produces incorrect results) the array will rebuild the data from the other SSDs in the XDP redundancy group. As explained in the
XtremIO Flash Specific Data Protection White Paper, XtremIO is able to rebuild the information even when two SSDs or data
stripes are inaccessible in the same redundancy group.
The journaling and metadata in the system are critical for recovery from catastrophic events. Due to the importance of such
datasets the journals are protected by CRC for every written block and there are three mirrored copies of the metadata. Having
three copies not only provides extra redundancy, but also gives a majority vote scenario. For instance, in case two metadata
XtremIO OS (XIOS)
XtremIO system updates are usually limited to XIOS and only modify the executable code that runs in user space. XIOS code is
upgraded by loading the new code into resident memory on the individual Storage Controllers and instantaneously flipping all
the Storage Controllers to run the new code. There is no impact to the host application and the system is completely available
during all this time.
SYSTEM RECOVERABILITY
XtremIO shutdown and power up processes are risk-free due to the simple design of the system. There are two modes of
shutdown: Orderly and Emergency.
Orderly Shutdown
Orderly shutdowns are a graceful process, initiated as part of external power loss or upon user request. The battery backup units
hold enough power to have two complete orderly shutdowns. When the system comes up it checks that the batteries have at
least enough power in order to provide a complete graceful shutdown upon loss of power.
In an Orderly Shutdown all the Storage Controllers in the system coordinate the shutdown process (via the System Wide
Management module). The system will stop accepting new I/O requests and coordinate stoppage of all journaling activity, write
data held in memory to the SSDs, and halt the system.
Emergency Shutdown
In the case of a catastrophic event where system communication is inadequate for any reason, each Storage Controller has the
capability to shut itself down and persistently maintain consistent data until power is restored. Each Storage Controller has two
local vault drives mirrored in a RAID-1 configuration. Upon loss of communications or power, the Storage Controller will save
two copies of all user data and system metadata. All the local memory and journal is dumped to the vault drives. When power
returns and communications get restored, the Storage Controller will reconcile its journal information with the rest of the
system.
Separate path from system entry to SSD for user data and its accompanying fingerprint
Secured journaling protecting against unexpected system shutdown, component failures, or communication failures
CONTACT US
To learn more about how
EMC products, services, and
solutions can help solve your EMC2, EMC, the EMC logo, XtremIO and the XtremIO logo are registered trademarks or
business and IT challenges, trademarks of EMC Corporation in the United States and other countries. VMware is a registered
trademark of VMware, Inc., in the United States and other jurisdictions. © Copyright 2014 EMC
contact your local
Corporation. All rights reserved. Published in the USA. 02/14 EMC White Paper H12914
representative or authorized
reseller—or visit us at EMC believes the information in this document is accurate as of its publication date.
www.EMC.com. The information is subject to change without notice.