You are on page 1of 13

EMC CLARiiON Global Hot Spares

and Proactive Hot Sparing


Best Practices Planning

Abstract
This white paper describes the features and functions of EMC® CLARiiON® global hot sparing and proactive
hot sparing for CX, CX3, and CX4 series storage systems.

September 2009
Copyright © 2005, 2006, 2007, 2009 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is
subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION
MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE
INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable
software license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com
All other trademarks used herein are the property of their respective owners.
Part Number C1069.5
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 2
Table of Contents

Executive summary ............................................................................................ 4


Introduction ......................................................................................................... 4
Audience ...................................................................................................................................... 4
Terminology ................................................................................................................................. 4
Characteristics of global hot sparing................................................................ 5
Rebuild and equalization time considerations ............................................................................. 5
Characteristics of proactive hot sparing .......................................................... 6
Proactive copy and equalization time considerations.................................................................. 7
Global hot spare drive selection algorithm ...................................................... 7
Considerations for designating global hot spare drives................................. 8
Different drive capacities in different RAID groups ...................................................................... 8
Different rotational speeds in different RAID groups ................................................................... 9
Mix of rotational speeds and sizes in different RAID groups ....................................................... 9
Mix of back-end bus speeds ........................................................................................................ 9
Additional considerations ............................................................................................................. 9
RAID types that benefit from hot sparing and proactive hot sparing .......... 10
Probational drives and rebuild logging .......................................................... 10
Conclusion ........................................................................................................ 11
Appendix: Frequently asked questions (FAQ) ............................................... 12
Global hot sparing...................................................................................................................... 12
Proactive hot sparing ................................................................................................................. 13

EMC CLARiiON Global Hot Spares


and Proactive Hot Sparing
Best Practices Planning 3
Executive summary
To minimize the risks posed by failed disk drives, EMC has developed global hot sparing technology.
Global hot sparing provides automatic, online rebuilds of redundant RAID groups when any of the group’s
disk drives fail. EMC has further advanced this functionality with the introduction of proactive hot sparing.
Proactive hot sparing recognizes when a disk is nearing failure and preemptively copies the disk’s content
prior to failure. The combination of these features minimizes each RAID group’s vulnerability to
additional drive failures, which could result in data loss. This technique also avoids the increased load of a
parity rebuild. Both technologies require global hot spare disk drives to be available and functioning
properly. EMC furthers these solutions with the introduction of rebuild logging. This functionality allows
for a drive in a redundant RAID group to be offline for a period of time while write I/O to this drive is
logged.
When one drive from a RAID group fails, that RAID group is in a degraded state. EMC® CLARiiON®
global hot sparing technology minimizes the time that RAID groups are in such a state. Multiple drive
failures in a RAID group can cause data loss, so it is advantageous to replace any failed or failing drives as
quickly as possible. CLARiiON’s proactive hot sparing technology provides further protection by
preemptively detecting failing drives and replacing them before the RAID group enters a degraded state.
With global hot sparing, FLARE®—CLARiiON core operating code—automatically swaps failed drives
with designated global hot spare drives, without the need for manual intervention. Proactive hot sparing
takes this functionality one step further. With proactive hot sparing, FLARE automatically recognizes
when drives show signs of imminent failure and copies the data from the failing drive to a global hot spare
drive, without the need for manual intervention. Proactive hot sparing minimizes a RAID group’s exposure
to a degraded state. Furthermore, this method avoids the disk-intensive rebuild process for RAID groups
that use parity. Although CLARiiON systems do not require the use of global hot spare drives for normal
operation, their inclusion greatly improves availability. This configuration enhances data integrity and high
availability by minimizing the exposure of RAID groups to concurrent drive failures within a storage
system.

Introduction
This white paper discusses best practices for implementing hot sparing, proactive hot sparing, and rebuild
logging with CLARiiON storage systems.
This white paper discusses:
• Characteristics of global hot sparing
• Characteristics of proactive hot sparing
• The global hot spare disk drive selection algorithm
• Considerations for designating global hot spare drives
• RAID types that benefit from global hot sparing and proactive hot sparing
• Probational drive support and rebuild logging

Audience
This white paper is intended for technical individuals who desire an in-depth understanding of the global
hot sparing and proactive hot sparing features available on CLARiiON CX, CX3, and CX4 series storage
systems.

Terminology
Degraded state: A RAID group, and the LUNs bound to it, is in a degraded state when one disk drive fails.
LUNs in a degraded state are at serious risk of data loss should additional drives fail. RAID levels 1, 3, and

EMC CLARiiON Global Hot Spares


and Proactive Hot Sparing
Best Practices Planning 4
5 can withstand a single drive failure. RAID 6 can withstand two drive failures. RAID 1/0 can withstand a
single failure per mirrored pair.
Equalize: The process of copying data from a hot spare drive to the replacement drive within the degraded
RAID group.
Global: Available to the entire CLARiiON storage system.
High availability: A system is said to be highly available when it has no single points of failure.
Hot spare drive: Any disk drive that has been allocated as a hot spare. Hot spare drives are always
globally available.
Proactive candidate: A drive that has been exhibiting symptoms of imminent failure that is being
proactively hot spared.
Proactive copy: The process of moving the data from the proactive candidate to the proactive spare.
Proactive spare: A spare drive that is the destination of the proactive candidate’s data.
Rebuild: The process of reading data and parity from the surviving disks of a RAID 3, 5, or 6 group;
reconstructing missing data from that information; and writing the missing data to a global hot spare or a
replacement drive.

Characteristics of global hot sparing


The CLARiiON global hot sparing process is as follows:
1. Global hot sparing is invoked by one of three conditions:
a. Manually initiate proactive copy
b. FLARE initiated (automatic) proactive copy
c. Drive failure or removal
2. An appropriate hot spare disk drive is chosen by FLARE, based on the algorithm outlined in the
“Global hot spare drive selection algorithm” section.
3. For RAID 3, 5, and 6 groups, the data from the failed drive is rebuilt from parity onto the hot spare
disk drive. For RAID 1/0 and RAID 1 groups, the data is copied from the surviving mirror to the hot
spare. Once the rebuild is complete, high availability is restored to the RAID group. Once the data
rebuild to a global hot spare disk drive starts, the rebuild continues to completion—even if the failed
drive is manually replaced before the rebuild to the hot spare completes.
4. Once data is completely rebuilt or copied onto the hot spare, and the failed drive is manually replaced,
FLARE equalizes (copies in the background) data from the hot spare to the replacement drive.
Equalization is a copy operation; the data has already been rebuilt to the hot spare.
5. Once equalization is complete, the RAID group is back in its normal highly available state, and the hot
spare is again globally available to replace other drives as needed.

Rebuild and equalization time considerations


The use of global hot sparing involves RAID group disk rebuilds and equalizations. Therefore, a discussion
of rebuild and equalization times is appropriate here.
Rebuild times depend on the following:
• Drive capacity
• Drive type (Enterprise Flash Drive, Fibre Channel, SAS, or SATA)
• User space on the drive bound into LUNs
• Rebuild priority
• Background I/O workload

EMC CLARiiON Global Hot Spares


and Proactive Hot Sparing
Best Practices Planning 5
• RAID type
• Number of drives in the RAID group (parity groups only)
• Distribution of drives over multiple Fibre Channel back-end loops (as applicable)
The measurements in Table 1 were collected on a CX4-960 with 4-Gb/s back-end buses, rebuilding an idle
LUN of maximum size (268.4 GB per drive) bound on a RAID group of 300 GB 15k rpm FC drives. The
rebuild times represent the times needed to completely rebuild the data from the failed disk onto the global
hot spare. Actual rebuild times on an active system may vary.
Table 1. Baseline rebuild times for a 300 GB Fibre Channel disk drive on an idle CX4-960,
ASAP priority 1

Type Rebuild rate


RAID 5 (4+1) 63 – 104 MB/s
RAID 6 (6+2) 63 – 99 MB/s
RAID 1/0 (3+3) 104 MB/s

Rebuild times are affected by rebuild priority settings and drive speed. Within the CX, CX3, and CX4
series storage systems, LUNs with higher rebuild priorities rebuild first. If multiple LUNs within the same
RAID group have the same priority level, they are further prioritized by size. The smallest LUNs will
rebuild first. If a RAID group is idle during a rebuild, all LUNs will rebuild ASAP regardless of their
priority.
A rebuild operation with an ASAP or a high priority recovers the LUN faster than one with medium or low
priority. However, a higher priority setting will have more impact on storage-system performance. If
performance of the RAID group is more important than the availability of a LUN, the rebuild priority
should be set to low for that LUN in order to minimize the rebuild’s impact on the system.
The equalization process is a disk-to-disk copy from the hot spare to the replacement drive; therefore it is
faster than the rebuild process. The measurements in Table 2 were collected on a CX4-960 with 4 Gb/s
back-end buses, rebuilding an idle LUN of maximum size (268.4 GB per drive) bound on a RAID group of
300 GB 15k rpm FC drives. When the equalize operation has completed, the hot spare becomes available
and the replacement drive operates as part of the RAID group.

Table 2. Baseline equalize times for a 300 GB Fibre Channel disk drive on an idle CX4-960

Type Equalize rate


RAID 5 (4+1) 104 MB/s
RAID 6 (6+2) 104 MB/s
RAID 1/0 (3+3) 104 MB/s

Characteristics of proactive hot sparing


Proactive hot sparing is a feature available on CLARiiON storage systems running FLARE release 24 and
later. By proactively copying data from the failing drive before it is taken out of service, the RAID group
is never exposed to a situation where additional drive failures would cause data loss. Also, by performing a
copy operation rather than a rebuild, the RAID group is not exposed to latent media defects on other drives
in the RAID group. These defects could cause the sector rebuild to fail. If a sector cannot be read from the
proactive candidate, the copy process can rebuild just that sector from redundant RAID information.
Proactive hot sparing can be initiated either automatically or manually. FLARE handles the automatic
version transparently. The manual version of the feature is available in Navisphere® Manager. To initiate
proactive hot sparing, right-click the disk icon of the proactive candidate and select Copy to hot spare.

1
Rebuild rate is affected by the back-end drive distribution of the RAID group.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 6
Whether it is initiated manually or automatically, proactive hot sparing follows these rules. If there is more
than one bound hot spare, all but one of the hot spares can be used for proactive sparing. However, if only
one hot spare is bound, that hot spare is eligible for proactive sparing. Also, only one proactive spare may
be active (proactively copying) in a RAID group at any given time.

The CLARiiON proactive hot sparing process is as follows:


1. When a drive reaches a certain error threshold or displays certain fault behavior(s), or proactive
copying has been manually triggered, FLARE marks it for proactive hot sparing and the drive is
designated as a proactive candidate.
2. An appropriate hot spare drive is chosen by FLARE based on the algorithm outlined in the “Global hot
spare drive selection algorithm” section.
3. The data from the proactive candidate is proactively copied to the proactive spare. Checkpoints are set
throughout the proactive copy. In the event that the proactive candidate fails during the proactive
copy, data after the last checkpoint is rebuilt. Data before the checkpoint has already been copied to
the proactive spare and does not need to be rebuilt.
4. Once this copy is complete, the proactive candidate is marked as faulted and may be safely replaced.
5. Once the faulted drive is manually replaced, FLARE equalizes data from the proactive spare to the
replacement drive.
6. Once equalization is complete, the RAID group is back to its normal state and the proactive spare is
again globally available to replace other drives as needed.

Proactive copy and equalization time considerations


The use of proactive hot sparing involves proactive copy and equalization operations. Proactive copy times
depend on the following:
• Drive capacity
• Drive type (Enterprise Flash Drive, Fibre Channel, SAS, or SATA)
• User space on the drive bound into LUNs
• Background I/O workload
The measurements in Table 3 were collected on a CX4-960 system from an idle LUN of maximum size
(268.4 GB per drive) bound on a RAID group of 300 GB 15k rpm FC drives. The proactive copy times
represent the times required to completely copy the data from the proactive candidate to the proactive spare
on an idle system. Actual proactive copy times on an active system may vary. The proactive sparing
process follows the same FLARE code path as equalizing, and therefore idle proactive sparing times are
expected to be the same as the equalize times in Table 2.
Table 3. Baseline proactive copy times for a 300 GB Fibre Channel disk drive on an idle
CX4-960

Type Proactive copy rate


RAID 5 (4+1) 104 MB/s
RAID 6 (6+2) 104 MB/s
RAID 1/0 (3+3) 104 MB/s

Global hot spare drive selection algorithm


When a drive is failed by FLARE or chosen as a proactive candidate, a selection algorithm goes through
the following steps to decide which hot spare disk to use as the rebuild or copy target:

EMC CLARiiON Global Hot Spares


and Proactive Hot Sparing
Best Practices Planning 7
1. Drive type: CLARiiON storage systems typically use Fibre Channel, Enterprise Flash Drive (EFD),
SAS, or SATA drives. Some legacy systems may have Advanced Technology-Attached (ATA) drives.
In this step, FLARE determines the type of drive being spared and which drive types make acceptable
spares. Fibre Channel and SATA-II drives can be hot spares for other Fibre Channel and SATA-II
drives. Only an ATA drive can be a hot spare for other ATA drives. Only an EFD can be a hot spare
for an EFD.

2. Size: The selection algorithm ascertains the size of each global hot spare available on the array. The
algorithm then identifies the hot spare drives that are large enough for the user 2 LUNs that are bound
on the disk being spared. The size of the hot spare selected depends on the size of the LUNs bound on
the disk being spared, not the raw capacity of the disk.

Although this optimization makes it possible for a lower-capacity hot spare to replace a drive of larger
capacity, it is not recommended that hot spares be planned in a manner that will rely on this. Once the entire
drive is bound into LUNs, the smaller hot spare will no longer be able to replace it. To adequately protect
drives now, and in the future, hot spares should have at least as much capacity as the drives they may replace.

3. Location: If multiple drives meet the criteria in step 2, FLARE will first attempt to select a hot spare
on the same back-end bus as the drive being hot spared. If FLARE cannot select an appropriate hot
spare on the same bus as the drive being spared, an appropriate hot spare will be selected from the
remaining buses. The remaining buses are prioritized in ascending order.

Note: FLARE does not use the rotational speed of the disk drives in deciding which global hot spare to use.

Considerations for designating global hot spare drives


The use of global hot spare disk drives automatically minimizes the exposure time of a RAID group with a
failed or failing drive. EMC recommends that the user configure CLARiiON systems with one global hot
spare for every 30 RAID-protected disk drives (RAID 1, 1/0, 3, 5, and 6 only). Hot spares should be
distributed across the storage system’s back-end buses to provide replacements for the provisioned drives.
If the storage system is located remotely enough to make prompt service difficult, a higher ratio of global
hot spares drives to RAID drives may be preferable. Regardless of the hot spare configuration, prompt
replacement of failed drives is necessary to prevent data loss.
The storage administrator is responsible for allocating global hot spare disk drives. This is done using either
the Navisphere CLI or Navisphere Manager. When a storage system has a mix of RAID groups with
different capacities and rotational speeds, the allocation of global hot spares should be considered carefully.
Choosing specific hot spare capacities and rotational speeds can make a difference in the effectiveness of
the global hot spare plan.

Different drive capacities in different RAID groups


If RAID groups within a CLARiiON storage system are formed from drives with different capacities, it is
best to allocate global hot spares to match the larger drives. For example, if there are RAID groups formed
from 146 GB drives and other RAID groups formed from 300 GB drives, it would be best to allocate 300
GB drives as the global hot spares. Alternatively, a mix of 146 GB and 300 GB drive hot spares could be
deployed, and FLARE would select the appropriate hot spare to match the used capacity of the drive being
spared.

2
A certain amount of capacity on the vault drives is dedicated to FLARE. There may also be user LUNs on
these five drives. If one of these drives fails, only the user LUNs will be rebuilt onto a hot spare. Other
protection mechanisms are already in place for the system data. For example, the storage system’s current
configuration is protected by a triple mirror.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 8
Different rotational speeds in different RAID groups
If a storage system's RAID groups are formed from drives with the same capacities but with different
rotational speeds, it is best to allocate the fastest drives as the global hot spares. For example, if there are
RAID groups formed from 10k rpm 300 GB drives and other RAID groups formed from 15k rpm 300 GB
drives, the 15k rpm drives should be allocated as hot spares. The rotational speed of drives is not a factor in
the selection of a hot spare. It is a good idea to allocate global hot spares that are as fast or faster than the
drives they may be required to replace, so that once the hot spare has rebuilt into the RAID group, the
performance characteristics of the RAID group will be maintained.
Although the hot spare selection algorithm does not take drive rotational speed into account, the hot spare
selection algorithm prefers the bus of the failing drive. Therefore if buses are divided based on rotational
speed, the algorithm chooses a hot spare of the same rotational speed. For example, one back-end bus is all
10k rpm drives, another back-end bus is all 15k rpm drives, and hot spares are allocated on each. If a 10k
rpm drive fails, a 10k rpm hot spare will be selected first because it is on the same bus as the failing drive.

Mix of rotational speeds and sizes in different RAID groups


If there is a mix of RAID groups formed from drives with different capacities and different rotational
speeds, it may be best to allocate hot spares of varying speeds and capacities. For example, if there are
different RAID groups formed from 10k rpm 300 GB drives, 15k rpm 300 GB drives, and 10k rpm 400
GB drives, it is best to allocate both 15k rpm 300 GB drives and 10k rpm 400 GB drives for use as global
hot spares. In cases where a 300 GB drive failed, FLARE selects the best-fit global hot spare; in this case
that is the 15k rpm 300 GB drive. With this configuration, the performance characteristics of the RAID
group are maintained, regardless of whether the failed drive was from a 10k or 15k rpm RAID group.

Mix of back-end bus speeds


CX3 and CX4 series back-end buses operate at either 4 Gb/s or 2 Gb/s. FLARE does not use bus speed in
hot spare drive selection. However, FLARE does select a drive on the same bus as the failed drive if one is
available. Therefore, you should spread hot spare drives across the back-end loops. This recommendation
applies to all CLARiiON systems but is especially important for systems running one back-end loop at 4
Gb/s and another at 2 Gb/s.

Additional considerations
Special considerations apply when the array has a mixture of EFD, FC, SAS, and SATA drives. Fibre
Channel and SATA-II drives can hot spare for other Fibre Channel and SATA-II drives 3 . Only an EFD can
be a hot spare for other EFDs. Furthermore, only an ATA drive can be a hot spare for other ATA drives.
FC/SATA-II drives may not hot spare for EFDs (or ATA drives) and vice versa. Therefore, in a storage
system with mixed drive technologies, the best practice is to have one global hot spare for every 30 drives
of each type. For example, in a system with 15 FC drives and 15 EFDs, two global hot spares should be
allocated, one FC and one EFD.
A drive cannot be designated as a global hot spare if it is a vault drive. The vault drives reside in the first
five drive slots in the first DAE of all CX, CX3, and CX4 series arrays. Other than this restriction, a global
hot spare disk drive can be located on any back-end bus and in any enclosure.
FLARE does not provide the option to reserve specific hot spare drives for use with specific RAID groups
or drives. The best practice for maintaining the full functionality of CLARiiON systems is to implement an
effective monitoring and service program. Global hot spares play an important role in maintaining full
system functionality, but their main benefit is that they minimize the amount of time that RAID groups run
in a degraded mode. Regardless of the hot spare configuration, prompt replacement of failed drives is
important for the preservation of data integrity. Global hot spares are not a replacement for a
comprehensive monitoring and servicing program.

3
Low-Cost Fibre Channel (LCFC) drives are considered Fibre Channel drives.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 9
Finally, global hot spare drives do not eliminate the performance impact of rebuilding RAID groups. When
a drive in a RAID group fails and is replaced—either by a manual replacement or by an automatic global
hot spare replacement operation—CLARiiON processing power and I/O bandwidth are necessary to
recalculate the lost data from parity information and rebuild it to the hot spare. Proactive hot sparing
significantly reduces this overhead by performing a proactive copy rather than a rebuild. Once the rebuild
to the hot spare is complete and the failed drive is physically replaced, FLARE equalizes the data from the
hot spare to the replacement drive. This is merely a copy, so I/O is generated but very little processing
power is required.

RAID types that benefit from hot sparing and proactive


hot sparing
Global hot sparing and proactive hot sparing can be used with—and enhance the data integrity and high
availability of—the following protected RAID types:
• RAID 1 (data mirroring between two disks)
If either drive from a RAID 1 mirrored pair fails, an available global hot spare disk drive is used to
replace the failed drive and data is copied from the surviving mirror. Note that it is unnecessary to
rebuild data from parity; only a copy operation is needed to rebuild data to the hot spare.
• RAID 1/0 (data striping with mirroring)
If any drive from a RAID 1/0 group fails, an available global hot spare disk drive is used to replace the
failed drive and data is copied from the surviving mirror. Note that it is unnecessary to rebuild data
from parity; only a copy operation is needed to rebuild data to the hot spare. RAID 1/0 groups can
survive multiple drive failures without data loss if the failed drives are from different mirrored pairs.
• RAID 3 (data striping with dedicated parity disk)
If any drive from a RAID 3 group fails, an available global hot spare disk drive is used to replace the
failed drive and rebuild the RAID group.
• RAID 5 (data striping with parity spread across all drives)
If any drive from a RAID 5 group fails, an available global hot spare disk drive is used to replace the
failed drive and rebuild the RAID group.
• RAID 6 (data striping with dual parity spread across all drives)
If any drive from a RAID 6 group fails, an available global hot spare disk drive is used to replace the
failed drive and rebuild the RAID group. With RAID 6, up to two failed drives may be rebuilt
simultaneously while keeping the RAID group’s LUNs online.

Hot sparing and proactive hot sparing are not available for RAID 0 (data striping without parity) groups and
individual disk drives because they do not have redundancy to support data rebuilds.

Probational drives and rebuild logging


Rebuild logging is a feature available on systems running FLARE release 24 and later. This new
functionality allows for a drive in a redundant RAID group to be offline for a period of time while write I/O
to this drive is logged. If the drive becomes accessible again within a set time limit the rebuild log will be
used to do a quick rebuild of the drive.

Occasionally, otherwise healthy drives 4 will appear to be offline while they complete a sector remap or
other similar operation. Although this is a normal operation, it can take several minutes. During this small
window, the drive is online but unresponsive. Once the internal operation is complete, the drive resumes
normal operation and responsiveness.

4
This behavior is seen more often with SATA-II drives than FC drives.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 10
Without rebuild logging, all I/O to the unavailable drive fails and FLARE quickly marks the drive as
“failed” and in need of a rebuild. Regardless of whether this unavailability is permanent or temporary,
manual intervention is required to replace the drive. Once the drive is replaced, a full rebuild is performed.
A full rebuild can be quite time-consuming for large drives. Furthermore, during the rebuild, the RAID
group may not have data redundancy.

Rebuild logging greatly reduces rebuild times and eliminates the need for manual intervention on
temporarily unavailable drives. When I/O to a drive fails due to a time-out error, the drive is considered for
probational status. Probational drives will delay a hot spare from swapping in. If the probational drive is in
a redundant RAID group, a rebuild log is created. This log tracks the sectors of the drive that should have
changed while the drive was unavailable. The accessibility of the drive is tested every 30 seconds for
approximately five minutes. Drives that become accessible within this time are brought back online. The
rebuild log is then used to rebuild only those sectors that changed while the drive was offline. Drives that
do not become accessible within the time limit are marked as “failed” and require a full rebuild.

Some rebuild logging functionality can be extended to nonredundant RAID groups. A drive from a
nonredundant group may be considered for probational status. However, changes will not be tracked. If
the drive becomes available again within the time limit, it will be brought back online. Drives that do not
become accessible within the time limit are marked as “failed” and must be manually replaced 5 .

Physically removing a drive will cancel the probational state of the drive and force a full rebuild to a hot
spare. Do not physically remove drives in a probational state.

Conclusion
When a RAID group operates in a degraded mode, it heightens the risk of data loss. Hot sparing, proactive
hot sparing, and rebuild logging work together to minimize this risk. Hot sparing rebuilds failed drives to
global hot spare disk drives, restoring high availability to the RAID group. Proactive hot sparing takes this
functionality one step further; by identifying failing drives before they fail, proactive hot sparing has the
potential to drastically reduce the time a RAID group stays in a degraded state. Rebuild logging allows a
drive in a redundant RAID group to be offline for a period of time while write I/O to the drive is logged.

Hot sparing, proactive hot sparing, and rebuild logging come standard with FLARE. EMC strongly
recommends that CLARiiON storage systems be configured with global hot spare disk drives to take full
advantage of the enhanced data integrity and high availability that hot sparing and proactive hot sparing
offer.

5
It is recommended to protect RAID groups through either parity or mirroring.
EMC CLARiiON Global Hot Spares
and Proactive Hot Sparing
Best Practices Planning 11
Appendix: Frequently asked questions (FAQ)
Global hot sparing
Following are some frequently asked questions regarding CLARiiON global hot spares:
• Question: Can a specific hot spare disk be assigned to a specific RAID group?
Answer: No, CLARiiON hot spares are global. They will be assigned appropriately to a RAID group
if a drive should fail.
• Question: For a condition in which the hot spare disk specs do not match the RAID group disk specs,
what are the performance issues?

Answer: Using a hot spare with lower performance characteristics than the failed drive impacts the
RAID group’s performance more if the RAID group is disk bound. If the RAID group is not disk
bound, especially if I/Os are cached, using a hot spare with lower performance characteristics has a
minimal impact on the RAID group’s performance. In cases where performance is critical, designate
disks with the highest rotational speed as global hot spares. In any case, the impact of the rebuild is
more significant than the performance penalty for the equalization process.
• Question: How are hot spare disk drives selected when a disk failure occurs?

Answer: Hot spare disk drives are selected by type (FC/SATA-II, EFD, or ATA), size, and bus
number. The “Global hot spare drive selection algorithm” section has more information.

• Question: Are hot spares pooled into a common group?


Answer: Yes, CLARiiON hot spares are global. Any designated hot spare can replace any failed drive
in the storage system, provided it meets the criteria of the selection process.
• Question: How many hot spare drives can be allocated within a storage system?
Answer: Any drive that is not a vault drive can be allocated as a hot spare. For example on a CX4-960
with 960 drives, a RAID 5 group can be created over the first five system drives (the first five CX4-
960 drives are vault drives and cannot be designated as global hot spare drives). The remaining 955
drives could be configured as global hot spares.
• Question: How many hot spare drives should be allocated within a storage system?

Answer: One hot spare drive should be allocated for every 30 drives of each type (FC/SATA-II, EFD,
or ATA). For example, in a system with 15 FC drives and 15 EFDs, two global hot spares should be
allocated, one FC and one EFD. In larger systems, hot spare drives should be evenly distributed across
back-end loops.
• Question: How do metaLUNs affect hot spare rebuilds?
Answer: MetaLUNs do not affect hot spare rebuilds. The rebuilds still occur at the individual LUN
(metaLUN’s component LUN) level.

EMC CLARiiON Global Hot Spares


and Proactive Hot Sparing
Best Practices Planning 12
Proactive hot sparing
• Question: Can a specific proactive hot spare disk be selected for a manually initiated proactive copy?
Answer: No, regardless of how the proactive hot spare is initiated, hot spare drives are selected
according to a selection algorithm. The “Global hot spare drive selection algorithm” section has more
information.
• Question: How is a proactive spare disk drive selected when a proactive copy is initiated?
Answer: Proactive spare disk drives are selected by type (FC/SATA-II, EFD, or ATA), size, and bus
number. The “Global hot spare drive selection algorithm” section has more information.
• Question: When a proactive copy is initiated, which LUNs are copied first?
Answer: Like the equalize operation, the proactive copy operation progresses sequentially through the
logical block addresses.

• Question: What happens when the proactive spare drive fails during proactive copy?

Answer: When the proactive spare fails during a proactive copy operation, the proactive spare is
swapped out. A new request to proactively hot spare will be required to initiate another proactive copy
operation. A message will be logged to the event log alerting the user that the proactive hot sparing
failed.

• Question: What happens when the proactive candidate drive fails during proactive copy?

Answer: When a proactive candidate fails during the proactive copy operation, the proactive hot
sparing process transitions into hot sparing. Upon the failure of the proactive candidate, all units are
marked as “needing a rebuild,” and the checkpoints are updated for each partition that needs a rebuild.
The checkpoints will show how far the rebuild got, so that the entire hot spare does not have to be
rebuilt.

• Question: What happens when another drive fails in the same RAID group as the proactive candidate
during the proactive copy?

Answer: When a drive fails in a RAID group that is proactively copying, the LUN enters a degraded
state. The proactive copy to the proactive spare will continue to completion. If faulting the proactive
candidate will cause the RAID group to shut down, then the proactive candidate is left alive and the
proactive copy percentage remains 100 percent. Hot sparing is activated on the failed drive. Once the
fault in the RAID group is fixed and all data rebuilt, the proactive candidate will be faulted.

EMC CLARiiON Global Hot Spares


and Proactive Hot Sparing
Best Practices Planning 13

You might also like