Professional Documents
Culture Documents
Abstract
This white paper describes the features and functions of EMC® CLARiiON® global hot sparing and proactive
hot sparing for CX, CX3, and CX4 series storage systems.
September 2009
Copyright © 2005, 2006, 2007, 2009 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is
subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION
MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE
INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable
software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com
All other trademarks used herein are the property of their respective owners.
Part Number C1069.5
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 2
Table of Contents
Introduction
This white paper discusses best practices for implementing hot sparing, proactive hot sparing, and rebuild
logging with CLARiiON storage systems.
This white paper discusses:
• Characteristics of global hot sparing
• Characteristics of proactive hot sparing
• The global hot spare disk drive selection algorithm
• Considerations for designating global hot spare drives
• RAID types that benefit from global hot sparing and proactive hot sparing
• Probational drive support and rebuild logging
Audience
This white paper is intended for technical individuals who desire an in-depth understanding of the global
hot sparing and proactive hot sparing features available on CLARiiON CX, CX3, and CX4 series storage
systems.
Terminology
Degraded state: A RAID group, and the LUNs bound to it, is in a degraded state when one disk drive fails.
LUNs in a degraded state are at serious risk of data loss should additional drives fail. RAID levels 1, 3, and
Rebuild times are affected by rebuild priority settings and drive speed. Within the CX, CX3, and CX4
series storage systems, LUNs with higher rebuild priorities rebuild first. If multiple LUNs within the same
RAID group have the same priority level, they are further prioritized by size. The smallest LUNs will
rebuild first. If a RAID group is idle during a rebuild, all LUNs will rebuild ASAP regardless of their
priority.
A rebuild operation with an ASAP or a high priority recovers the LUN faster than one with medium or low
priority. However, a higher priority setting will have more impact on storage-system performance. If
performance of the RAID group is more important than the availability of a LUN, the rebuild priority
should be set to low for that LUN in order to minimize the rebuild’s impact on the system.
The equalization process is a disk-to-disk copy from the hot spare to the replacement drive; therefore it is
faster than the rebuild process. The measurements in Table 2 were collected on a CX4-960 with 4 Gb/s
back-end buses, rebuilding an idle LUN of maximum size (268.4 GB per drive) bound on a RAID group of
300 GB 15k rpm FC drives. When the equalize operation has completed, the hot spare becomes available
and the replacement drive operates as part of the RAID group.
Table 2. Baseline equalize times for a 300 GB Fibre Channel disk drive on an idle CX4-960
1
Rebuild rate is affected by the back-end drive distribution of the RAID group.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 6
Whether it is initiated manually or automatically, proactive hot sparing follows these rules. If there is more
than one bound hot spare, all but one of the hot spares can be used for proactive sparing. However, if only
one hot spare is bound, that hot spare is eligible for proactive sparing. Also, only one proactive spare may
be active (proactively copying) in a RAID group at any given time.
2. Size: The selection algorithm ascertains the size of each global hot spare available on the array. The
algorithm then identifies the hot spare drives that are large enough for the user 2 LUNs that are bound
on the disk being spared. The size of the hot spare selected depends on the size of the LUNs bound on
the disk being spared, not the raw capacity of the disk.
Although this optimization makes it possible for a lower-capacity hot spare to replace a drive of larger
capacity, it is not recommended that hot spares be planned in a manner that will rely on this. Once the entire
drive is bound into LUNs, the smaller hot spare will no longer be able to replace it. To adequately protect
drives now, and in the future, hot spares should have at least as much capacity as the drives they may replace.
3. Location: If multiple drives meet the criteria in step 2, FLARE will first attempt to select a hot spare
on the same back-end bus as the drive being hot spared. If FLARE cannot select an appropriate hot
spare on the same bus as the drive being spared, an appropriate hot spare will be selected from the
remaining buses. The remaining buses are prioritized in ascending order.
Note: FLARE does not use the rotational speed of the disk drives in deciding which global hot spare to use.
2
A certain amount of capacity on the vault drives is dedicated to FLARE. There may also be user LUNs on
these five drives. If one of these drives fails, only the user LUNs will be rebuilt onto a hot spare. Other
protection mechanisms are already in place for the system data. For example, the storage system’s current
configuration is protected by a triple mirror.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 8
Different rotational speeds in different RAID groups
If a storage system's RAID groups are formed from drives with the same capacities but with different
rotational speeds, it is best to allocate the fastest drives as the global hot spares. For example, if there are
RAID groups formed from 10k rpm 300 GB drives and other RAID groups formed from 15k rpm 300 GB
drives, the 15k rpm drives should be allocated as hot spares. The rotational speed of drives is not a factor in
the selection of a hot spare. It is a good idea to allocate global hot spares that are as fast or faster than the
drives they may be required to replace, so that once the hot spare has rebuilt into the RAID group, the
performance characteristics of the RAID group will be maintained.
Although the hot spare selection algorithm does not take drive rotational speed into account, the hot spare
selection algorithm prefers the bus of the failing drive. Therefore if buses are divided based on rotational
speed, the algorithm chooses a hot spare of the same rotational speed. For example, one back-end bus is all
10k rpm drives, another back-end bus is all 15k rpm drives, and hot spares are allocated on each. If a 10k
rpm drive fails, a 10k rpm hot spare will be selected first because it is on the same bus as the failing drive.
Additional considerations
Special considerations apply when the array has a mixture of EFD, FC, SAS, and SATA drives. Fibre
Channel and SATA-II drives can hot spare for other Fibre Channel and SATA-II drives 3 . Only an EFD can
be a hot spare for other EFDs. Furthermore, only an ATA drive can be a hot spare for other ATA drives.
FC/SATA-II drives may not hot spare for EFDs (or ATA drives) and vice versa. Therefore, in a storage
system with mixed drive technologies, the best practice is to have one global hot spare for every 30 drives
of each type. For example, in a system with 15 FC drives and 15 EFDs, two global hot spares should be
allocated, one FC and one EFD.
A drive cannot be designated as a global hot spare if it is a vault drive. The vault drives reside in the first
five drive slots in the first DAE of all CX, CX3, and CX4 series arrays. Other than this restriction, a global
hot spare disk drive can be located on any back-end bus and in any enclosure.
FLARE does not provide the option to reserve specific hot spare drives for use with specific RAID groups
or drives. The best practice for maintaining the full functionality of CLARiiON systems is to implement an
effective monitoring and service program. Global hot spares play an important role in maintaining full
system functionality, but their main benefit is that they minimize the amount of time that RAID groups run
in a degraded mode. Regardless of the hot spare configuration, prompt replacement of failed drives is
important for the preservation of data integrity. Global hot spares are not a replacement for a
comprehensive monitoring and servicing program.
3
Low-Cost Fibre Channel (LCFC) drives are considered Fibre Channel drives.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 9
Finally, global hot spare drives do not eliminate the performance impact of rebuilding RAID groups. When
a drive in a RAID group fails and is replaced—either by a manual replacement or by an automatic global
hot spare replacement operation—CLARiiON processing power and I/O bandwidth are necessary to
recalculate the lost data from parity information and rebuild it to the hot spare. Proactive hot sparing
significantly reduces this overhead by performing a proactive copy rather than a rebuild. Once the rebuild
to the hot spare is complete and the failed drive is physically replaced, FLARE equalizes the data from the
hot spare to the replacement drive. This is merely a copy, so I/O is generated but very little processing
power is required.
Hot sparing and proactive hot sparing are not available for RAID 0 (data striping without parity) groups and
individual disk drives because they do not have redundancy to support data rebuilds.
Occasionally, otherwise healthy drives 4 will appear to be offline while they complete a sector remap or
other similar operation. Although this is a normal operation, it can take several minutes. During this small
window, the drive is online but unresponsive. Once the internal operation is complete, the drive resumes
normal operation and responsiveness.
4
This behavior is seen more often with SATA-II drives than FC drives.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 10
Without rebuild logging, all I/O to the unavailable drive fails and FLARE quickly marks the drive as
“failed” and in need of a rebuild. Regardless of whether this unavailability is permanent or temporary,
manual intervention is required to replace the drive. Once the drive is replaced, a full rebuild is performed.
A full rebuild can be quite time-consuming for large drives. Furthermore, during the rebuild, the RAID
group may not have data redundancy.
Rebuild logging greatly reduces rebuild times and eliminates the need for manual intervention on
temporarily unavailable drives. When I/O to a drive fails due to a time-out error, the drive is considered for
probational status. Probational drives will delay a hot spare from swapping in. If the probational drive is in
a redundant RAID group, a rebuild log is created. This log tracks the sectors of the drive that should have
changed while the drive was unavailable. The accessibility of the drive is tested every 30 seconds for
approximately five minutes. Drives that become accessible within this time are brought back online. The
rebuild log is then used to rebuild only those sectors that changed while the drive was offline. Drives that
do not become accessible within the time limit are marked as “failed” and require a full rebuild.
Some rebuild logging functionality can be extended to nonredundant RAID groups. A drive from a
nonredundant group may be considered for probational status. However, changes will not be tracked. If
the drive becomes available again within the time limit, it will be brought back online. Drives that do not
become accessible within the time limit are marked as “failed” and must be manually replaced 5 .
Physically removing a drive will cancel the probational state of the drive and force a full rebuild to a hot
spare. Do not physically remove drives in a probational state.
Conclusion
When a RAID group operates in a degraded mode, it heightens the risk of data loss. Hot sparing, proactive
hot sparing, and rebuild logging work together to minimize this risk. Hot sparing rebuilds failed drives to
global hot spare disk drives, restoring high availability to the RAID group. Proactive hot sparing takes this
functionality one step further; by identifying failing drives before they fail, proactive hot sparing has the
potential to drastically reduce the time a RAID group stays in a degraded state. Rebuild logging allows a
drive in a redundant RAID group to be offline for a period of time while write I/O to the drive is logged.
Hot sparing, proactive hot sparing, and rebuild logging come standard with FLARE. EMC strongly
recommends that CLARiiON storage systems be configured with global hot spare disk drives to take full
advantage of the enhanced data integrity and high availability that hot sparing and proactive hot sparing
offer.
5
It is recommended to protect RAID groups through either parity or mirroring.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 11
Appendix: Frequently asked questions (FAQ)
Global hot sparing
Following are some frequently asked questions regarding CLARiiON global hot spares:
• Question: Can a specific hot spare disk be assigned to a specific RAID group?
Answer: No, CLARiiON hot spares are global. They will be assigned appropriately to a RAID group
if a drive should fail.
• Question: For a condition in which the hot spare disk specs do not match the RAID group disk specs,
what are the performance issues?
Answer: Using a hot spare with lower performance characteristics than the failed drive impacts the
RAID group’s performance more if the RAID group is disk bound. If the RAID group is not disk
bound, especially if I/Os are cached, using a hot spare with lower performance characteristics has a
minimal impact on the RAID group’s performance. In cases where performance is critical, designate
disks with the highest rotational speed as global hot spares. In any case, the impact of the rebuild is
more significant than the performance penalty for the equalization process.
• Question: How are hot spare disk drives selected when a disk failure occurs?
Answer: Hot spare disk drives are selected by type (FC/SATA-II, EFD, or ATA), size, and bus
number. The “Global hot spare drive selection algorithm” section has more information.
Answer: One hot spare drive should be allocated for every 30 drives of each type (FC/SATA-II, EFD,
or ATA). For example, in a system with 15 FC drives and 15 EFDs, two global hot spares should be
allocated, one FC and one EFD. In larger systems, hot spare drives should be evenly distributed across
back-end loops.
• Question: How do metaLUNs affect hot spare rebuilds?
Answer: MetaLUNs do not affect hot spare rebuilds. The rebuilds still occur at the individual LUN
(metaLUN’s component LUN) level.
• Question: What happens when the proactive spare drive fails during proactive copy?
Answer: When the proactive spare fails during a proactive copy operation, the proactive spare is
swapped out. A new request to proactively hot spare will be required to initiate another proactive copy
operation. A message will be logged to the event log alerting the user that the proactive hot sparing
failed.
• Question: What happens when the proactive candidate drive fails during proactive copy?
Answer: When a proactive candidate fails during the proactive copy operation, the proactive hot
sparing process transitions into hot sparing. Upon the failure of the proactive candidate, all units are
marked as “needing a rebuild,” and the checkpoints are updated for each partition that needs a rebuild.
The checkpoints will show how far the rebuild got, so that the entire hot spare does not have to be
rebuilt.
• Question: What happens when another drive fails in the same RAID group as the proactive candidate
during the proactive copy?
Answer: When a drive fails in a RAID group that is proactively copying, the LUN enters a degraded
state. The proactive copy to the proactive spare will continue to completion. If faulting the proactive
candidate will cause the RAID group to shut down, then the proactive candidate is left alive and the
proactive copy percentage remains 100 percent. Hot sparing is activated on the failed drive. Once the
fault in the RAID group is fixed and all data rebuilt, the proactive candidate will be faulted.