This action might not be possible to undo. Are you sure you want to continue?
Parity Declustering for Continuous Operation in Redundant Disk Arrays
Mark Holland Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213-3890 email@example.com Garth A. Gibson School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3890 firstname.lastname@example.org
We describe and evaluate a strategy for declustering the parity encoding in a redundant disk array. This declustered parity organization balances cost against data reliability and performance during failure recovery. It is targeted at highly-available parity-based arrays for use in continuousoperation systems. It improves on standard parity organizations by reducing the additional load on surviving disks during the reconstruction of a failed disk’s contents. This yields higher user throughput during recovery, and/or shorter recovery time. We ﬁrst address the generalized parity layout problem, basing our solution on balanced incomplete and complete block designs. A software implementation of declustering is then evaluated using a disk array simulator under a highly concurrent workload comprised of small user accesses. We show that declustered parity penalizes user response time while a disk is being repaired (before and during its recovery) less than comparable non-declustered (RAID5) organizations without any penalty to user response time in the fault-free state. We then show that previously proposed modiﬁcations to a simple, single-sweep reconstruction algorithm further decrease user response times during recovery, but, contrary to previous suggestions, the inclusion of these modiﬁcations may, for many conﬁgurations, also slow the reconstruction process. This result arises from the simple model of disk access performance used in previous work, which did not consider throughput variations due to positioning delays.
demanding of these applications require continuous operation, which in terms of a storage subsystem requires (1) the ability to satisfy all user requests for data even in the presence of a disk failure, and (2) the ability to reconstruct the contents of a failed disk onto a replacement disk, thereby restoring itself to a fault-free state. It is not enough to fulﬁll these two requirements with arbitrarily degraded performance; it is not unusual for an organization that requires continuously-available data to incur ﬁnancial losses substantially larger than the organization’s total investment in computing equipment if service is severely degraded for a prolonged period of time. Since the time necessary to reconstruct the contents of a failed disk is certainly minutes and possibly hours, we focus this paper on the performance of a continuous-operation storage subsystem during on-line failure recovery. We do not recommend on-line failure recovery in an environment that can tolerate off-line recovery because the latter restores high performance and high data reliability more quickly. Redundant disk arrays, proposed for increasing input/output performance and for reducing the cost of high data reliability [Kim86, Livny87, Patterson88, Salem86], also offer an opportunity to achieve high data availability without sacriﬁcing throughput goals. A single-failure-correcting redundant disk array consists of a set of disks, a mapping of user data to these disks that yields high throughput [Chen90b], and a mapping of a parity encoding for the array’s data such that data lost when a disk fails can be recovered without taking the system off-line [Lee91]. Most single-failure-correcting disk arrays employ either mirrored or parity-encoded redundancy. In mirroring [Bitton88, Copeland89, Hsiao90], one or more duplicate copies of all data are stored on separate disks. In parity encoding [Kim86, Patterson88, Gibson91], popularized as Redundant Arrays of Inexpensive Disks (RAID), some subset of the physical blocks in the array are used to store a single-error-correction code (usually parity) computed over subsets of the data. Mirrored systems, while potentially able to deliver higher throughput than parity-based systems for some workloads [Chen90a, Gray90], increase cost by consuming much more disk capacity for redundancy. In this
Many applications, notably database and transaction processing, require both high throughput and high data availability from their storage subsystems. The most
This work was supported by the National Science Foundation under grant number ECD-8907068, by the Defense Advanced Research Projects Agency monitored by DARPA/CMO under contract MDA972-90-C-0035, and by an IBM Graduate Fellowship.
and consider the problem of decoupling G from the number of disks in the array.3 Figure 2-1: Parity and data layout in a leftsymmetric RAID5 organization. However. The size of a stripe unit. Given a ﬁxed user throughput. this means that user data is logically D0. which is the wallclock time taken to reconstruct the contents of a failed disk after replacement. Parity units are distributed across the disks of the array to avoid the write bottleneck that would occur if a single disk contained all parity units. we examine a parity-based redundancy scheme called parity declustering. parity is computed over the entire width of the array. and Pi represents the parity unit for parity stripe i.3 P4 DISK1 D0. including the parity unit.1.1 D4. rather than a set of disk blocks.0. contrasting single-thread and parallel reconstruction. D1. that all the disks in the array are needed by every access that requires reconstruction.j represents one of the four data units in parity stripe number 1. Following Muntz and Lui.0 D4.3 P3 D4.0 D3. deﬁnes RAID5 mappings in the context of this paper. contrasting user fault-free response time to the response time after failure (both before and during reconstruction) gives us the measure of our system’s performance degradation during failure recovery.paper. Section 8 then covers reconstruction performance. For comparison purposes.3. D0.1 D1.2 D4. Reconstruction time is important because it determines the length of time that the system operates at degraded performance. without the high capacity overhead of mirroring [Muntz90]1. maximizing read parallelism. and user response time during reconstruction.0 D2. so a disk array data mapping can not alone insure maximal parallelism. A parity stripe unit. We shall see that declustered parity relaxes this constraint. a ﬁle system may or may not allocate user blocks contiguously in this address space. each containing fewer units. D0. When a disk is identiﬁed as failed. this larger set of C disks is the whole array. D1. and computing the cumulative exclusive-or of this data. or simply a data unit is deﬁned as the minimum amount of contiguous user data allocated to one disk before any data is allocated to any other disk. Section 4 presents our parity mapping. . Our use follows the earlier work of Copeland and Keller [Copeland89] where redundancy information is “declustered” over more than the minimal collection of disks. any data unit can be reconstructed by reading the corresponding units in the parity stripe. Muntz and Lui use the term clustered where we use the term declustered. Offset 0 1 2 3 4 DISK0 D0.0 through D0. The declustered parity layout policy Figure 2-1 illustrates the parity and data layout for a left-symmetric RAID5 redundant disk array [Lee91]. that is. A data stripe unit. a left-symmetric RAID5 organization also speciﬁes the data layout: data is mapped to stripe units Di.0 D1. deﬁnition [Patterson88] as a set of disks.0 DISK2 D0. In Figure 2-1. called a unit for convenience. One perspective on the concept of parity declustering in redundant disk arrays is demonstrated in Figure 2-2.j according to ascending j within ascending i. that G = C.1. and evaluating alternative reconstruction algorithms. We use the term stripe unit instead of data unit or parity unit when the distinction between data and parity is not pertinent to the point being made.2. D0.2 D3.0. Note.2 DISK4 P0 D1.3 P1 D2. notably the introduction of declustering by Muntz and Lui [Muntz90].3 P2 D3. In Figure 2-1. et. Section 2 of this paper describes our terminology and presents the declustered parity organization. al. This is a good choice because any set of ﬁve contiguous data units map to ﬁve different disks. is a block of parity information that is the size of a data stripe unit.2 D1. P0 is the cumulative parity (exclusiveor) of data units D0. which was left as an open problem by Muntz and Lui. In Figure 2-1. must be an integral number of sectors.1 D3. Our primary ﬁgures of merit in this paper are reconstruction time. C. the RAID5 example in Figure 2-1 has G = C = 5. A parity stripe2 is the set of data units over which a parity unit is computed. which provides better performance during on-line failure recovery than more common RAID schemes. let G be the number of units in a parity stripe. or simply a parity unit.3. Their use may be derived from “clustering” independent RAIDs into a single array with the same parity overhead. For our purposes.1 D2. and because it is a signiﬁcant contributor to the length of time that the system is vulnerable to data loss caused by a second failure. (The way this 2. however. and when there is a failed disk but no replacement. and is often the minimum unit of update used by system software.2 D2. including the parity unit. Section 5 gives a brief overview of our simulation environment and Sections 6 and 7 present brief analyses of the performance of a declustered array when it is fault-free.1 DISK3 D0. i. This reduces to a problem of ﬁnding a parity mapping that will allow parity stripes of size G units to be distributed over some larger number of disks. Muntz and Lui [Muntz90] use the term group to denote what we call a parity stripe. Section 3 describes related studies. a logical RAID5 array with G = 4 is distributed over C = 7 > G disks. Section 9 concludes the paper with a look at interesting topics for future work. The disk array’s data layout provides the abstraction of a linear (“logical block”) address space to the ﬁle system. etc. This property. Di. In addition to mapping data units to parity stripes. but we avoid this usage as it conﬂicts with the Patterson. 2. and explains the motivations behind our investigation. plus the parity unit itself.
This parameter. Muntz and Lui applied ideas similar to those of Copeland and Keller to parity-based arrays. and 8 are parameterized by α. More disk units are consumed by parity. What is important at this point is that ﬁfteen data units are mapped onto ﬁve parity stripes in the array’s ﬁrst twenty disk units.2 D3. the data reliability. Although Figure 2-3 does not spread parity units evenly over all disks. if. mirrored systems allocate one disk as a primary and another as a secondary. Hsiao90]. the constructions we present in Section 4 all possess this property.1 D4.2 DISK4 P1 P2 P3 P4 3. Thus any two failures in the C disks constituting the array will cause data loss. making a number of simplifying assumptions.0 D3. note that for any given stripe unit on a failed (physical) disk. and 6 do not participate in the reconstruction of the parity stripe marked ‘S’. G.1 D4. Physical Array Logical Array 0 S 1 S 2 S 3 S 0 S 1 S 2 3 4 S 5 S 6 the fraction of each surviving disk that must be read during the reconstruction of a failed disk. Traditionally.mapping is generated is the topic of Section 4. for example. In Figure 2-2. In contrast.2 D3. It also distributes the workload associated with reconstructing a failed disk across all surviving disks in the array. The other half of each disk contains a portion of the secondary copy data from each of the primaries on all other disks. disk zero fails. 3. They proposed the declustered parity organization described in Section 2. and data reliability needs.1 D1. Hence. Offset 0 1 2 3 DISK0 D0.1 DISK3 P0 D2.0 D2. determines the percentage of total disk space consumed by parity. We attempt in this paper to identify the limits of this theoretical analysis and provide performance predictions based instead on a software implementation and array simulation. the parity stripe to which it belongs includes units on only a subset of the total number of disks in the array. Related work The idea of improving failure-mode performance by declustering redundancy information originated with mirrored systems [Copeland89. 7. indicates . in Figure 2-3. while in the RAID5 organization of Figure 2-1. C G Figure 2-2: One possible declustering of a parity stripe of size four over an array of seven disks. so a smaller fraction of each surviving disk is read during reconstruction. disks 2. This insures that a failure can be recovered since the primary and secondary copies of any data are on different disks. and the cost-effectiveness of the array.2 D1. and then modeled it analytically. Figure 2-3 presents a declustered parity layout for G = 4 and C = 5 that is described in Section 4. This paper provides analyses upon which these decisions can be based.1 D2. Note that α = 1 for the RAID5 organization. C determines the cost of the array because it speciﬁes the number of disks. Response time is important to all customers Figure 2-3: Example data layout in a declustered parity organization. It also determines the array’s data reliability because our mappings require some data from all surviving disks to reconstruct a failed disk. system administrators need to be able to specify C and G at installation time according to their cost.) The advantage of this approach is that it reduces the reconstruction workload applied to each disk during failure recovery. indicating that all of every surviving disk participates in a reconstruction. Hsiao and DeWitt propose a variant called chained declustering. All of our performance graphs in Sections 6. The parameters C and G and the ratio α together determine the reconstruction performance. Unfortunately. a smaller value should yield better reconstruction performance since a failed disk’s reconstruction workload is spread over a larger number of disks. 1/G. capacity. performance. but not every parity stripe is represented on each disk. First.0 DISK2 D0.0 DISK1 D0. Note that the successive stripe units in a parity stripe occur at varying disk offsets: Section 6 shows that this has no signiﬁcant performance impact.0 D1. In general. For example. these disks are called on less often in the reconstruction of one of the other disks. Toward this end we have two primary concerns with the Muntz and Lui analysis. on the other hand. a RAID5 array has C = G. Finally. their study assumes that either the set of surviving disks or the replacement disk is driven at 100% utilization.2 D4. that increases the array’s data reliability. Copeland and Keller describe a scheme called interleaved declustering which treats primary and secondary data copies differently. sixteen data units are mapped onto four parity stripes in the same number of disk units. parity stripe four will not have to be read during reconstruction. the declustering ratio α determines the reconstruction performance of the system. and so all disks participate in the reconstruction of all units of the failed disk. To see this. Copeland and Keller instead allocate only half of each disk for primary copies. which we call the declustering ratio. Muntz and Lui deﬁne the ratio (G-1)/(C-1) as α. driving a queueing system such as a magnetic disk at full utilization leads to arbitrarily long response times.
Real disk accesses are. 1. In this paper we will use balanced incomplete and complete block designs (described in Section 4. this rule will apply even during these relatively rare recovery intervals. and hence the time needed to reconstruct the track will not decrease. As an example.2) to achieve better performance during reconstruction. Every data update causes a parity update. however. suppose that a given track on a replacement disk is being reconstructed. Our analysis reports on user response time during recovery and presents a simple scheme trading off reconstruction time for user response time. which is not the case. The allocation of contiguous user data to data units should correspond to the allocation of data units to parity stripes. This is the basic characteristic of any redundancy organization that recovers the data of failed disks. Parity information should be evenly distributed across the array. Layout goals Previous work on declustered parity has left open the problem of allocating parity stripes in an array. Data layout strategy In this section. such as power or data cabling. Whichever option is selected.2. In arrays in which groups of disks have a common failure mode. This idea that disk drives are not “work-preserving” due to head positioning and spindle rotation delays is an effect that is difﬁcult to model analytically. The speciﬁcations for OLTP systems often require some minimal level of responsiveness. These units may be skipped over by reconstruction. the previously-reconstructed sectors will still have to rotate under the disk heads. they should consume neither excessive computation nor memory resources. Reddy [Reddy91] has also used block designs to improve the recovery-mode performance of an array. His approach generates a layout with properties similar to ours. must be efﬁciently implementable. and so an uneven parity distribution would lead to imbalanced utilization (hot spots). tionship between user data allocation and parity stripe organization. while the last two make recommendations for the rela- . since the disks with more parity would experience more load.and critical in database and on-line transaction-processing (OLTP) systems. for example. Layout strategy Our declustered parity layouts are speciﬁcally designed to meet our criterion for distributed reconstruction while also lowering the amount of reconstruction work done by each surviving disk. and that a few widely scattered stripe units on that track are already valid because they were written as a result of user (not reconstruction) accesses during reconstruction. 3. Efﬁcient mapping. No two stripe units in the same parity stripe may reside on the same physical disk. Dibble90]. Large write optimization. we describe our layout goals. but is restricted to the case where G = C/2. In a continuousoperation system that requires minutes to hours for the recovery of a failed disk. The ﬁrst four of these deal exclusively with relationships between stripe units and parity stripe membership. one benchmark requires that 90% of the transactions must complete in under two seconds [TPCA89]. Because ﬁle systems are free to and often do allocate user data arbitrarily into whatever logical space a storage subsystem presents. subject to positioning delays that are dependent on the current head position and the position of target data. this criteria should be extended to prohibit the allocation of stripe units from one parity stripe to two or more disks sharing that common failure mode [Schulze89. since the new parity unit depends only on the new data. Our second concern with the Muntz and Lui analysis is that their modeling technique assumes that all disk accesses have the same service time distribution. The Muntz and Lui model assumes that the track reconstruction time is reduced by a factor equal to the size of the units not needing reconstruction divided by the size of the track. Single failure correcting. When any disk fails. When replaced or repaired. As shown in Figure 2-1. our parity layout procedures have no direct control over these latter two criteria. and the technique used to achieve them.1. The functions mapping a ﬁle system’s logical block address to physical disk addresses for the corresponding data units and parity stripes. We then comment on the generality of our approach. we have identiﬁed six criteria for a good parity layout. The distributed reconstruction criterion requires that the same number of units be read from 4. the left-symmetric mapping for RAID5 arrays meets all of these criteria. Extending from non-declustered parity layout research [Lee90. or they may simply be reconstructed along with the rest of the track and over-written with the data that they already hold. Distributed reconstruction. This insures that maximum parallelism can be obtained. 4. 4. its user workload should be evenly distributed across all other disks in the array. 5. Gibson93]. This insures that whenever a user performs a write that is the size of the data portion of a parity stripe and starts on a parity stripe boundary. Maximal parallelism. its reconstruction workload should also be evenly distributed. 6. A read of contiguous user data with size equal to a data unit times the number of disks in the array should induce a single data unit read on all disks in the array (while requiring alignment only to a data unit boundary). 4. it is possible to execute the write without pre-reading the prior contents of any disk data. Distributed parity. but relatively straightforward to address in a simulation-based study. and the appropriate inverse mappings. 2.
each disk will be assigned parity in exactly r tuples in the full block design table. the ﬁrst tuple in the design of Figure 4-1 is used to lay out parity stripe 0: the three data blocks in parity stripe 0 are on disks 0. stripe 1 is on disks 0. D0.1. that is. does not meet our maximal parallelism criterion. This example demonstrates a simple form of block design called a complete block design which includes all combinations of exactly k distinct elements selected from the set of v objects. The number of these combinations is v . A block design is an arrangement of v distinct objects into b tuples3. a block design with b = 5. as shown in the right side of Figure 4-2. For clarity. the simple mapping of data to successive data units within successive parity stripes (Figure 4-2) that we use in our simulations. and the minimum possible value for b (as explained in Section 4. D0. k. and this is guaranteed by the deﬁnition of a block design. 3. that is. and 2. 1. For example. is further duplicated until all stripe units on each disk are mapped to parity stripes. 1. our fourth criterion. This insures our single failure correcting criterion. Finally. r. we derive a layout as above and duplicate it G times (four times in Figure 4-2). the parity assignment function touches each element of the block design table exactly once over the course of one full block design table. This results in the assignment of parity to each disk in the cluster exactly r times over the course of a full block design table. which is that reconstruction be distributed evenly. This layout. we ﬁnd a block design with v = C. if we were to employ a data mapping similar to Lee’s left-symmetric parity (for non-declustered RAID5 arrays). each containing k elements. Note that the actual value of λp is not signiﬁcant. The layout we use associates disks with objects and parity stripes with tuples.0. . 2. The layout shown in the top quarter of Figure 4-2. Unfortunately it is not guaranteed that our layout will have an efﬁcient mapping.0. and 2. with parity on disk 4. We now show how well this layout procedure meets the ﬁrst four of our layout criteria. v = 5. This paper demonstrates that this can be done and one way to do it. disk i occurs in exactly λp parity stripes with each other disk. 1. Our distributed parity criterion is achieved because a full block design table is composed of G block design tables each assigning parity to a different element of each tuple. To build a parity layout. 4 Tuple 2: 0. Based on the second tuple. and the parity block is on disk 3. D0. stripe unit j of parity stripe i is assigned to the lowest available offset on the disk identiﬁed by the jth element of tuple i mod b in the block design. 2. 1. On the other hand. and r(k-1) = λp(v-1). D1. and in Figure 2-3. This will be achieved if the number of times that a pair of disks contain stripe units from the same parity stripe is constant across all pairs of disks. is derived via this process from the block design in Figure 4-1. and each pair of objects appears in exactly λp tuples. It is useful to note that only three of v. No pair of stripe units in the same parity stripe will be assigned to the same disk because the G elements of each tuple are distinct. refer to Figure 4-2: if we group together the vertical boxes in the right half of the ﬁgure we see that we have reconstituted one block design table. 2. D1. etc. b.3). 3. using non-negative integers as objects. k = G. when disk i fails. and λp = 3 is given in Figure 4-1. These tuples are called blocks in the block design literature. The second criterion. Section 4. and λp are k free variables because the following two relations are always true: bk = vr.. In general. the entire contents of Figure 4-2. This means that in each copy of the block design table. our ﬁfth and sixth criteria depend on the data mapping function used by higher levels of software. It is only necessary that it be constant across all pairs of disks. such that each object appears in exactly r tuples. To resolve this violation.0. Hence. k = 4. the following discussion is illustrated by the construction of the layout in Figure 4-2 from the block design in Figure 4-1. not all sets of ﬁve adjacent data units from the mapping. It is apparent from Figure 2-3 that this approach produces a layout that violates our distributed parity criterion (3). Muntz and Lui recognized and suggested that such layouts might be found in the literature for balanced incomplete block designs [Hall86]. is guaranteed because each pair of objects appears in exactly λp tuples. because the size of a block design table is not guaranteed to be small. Unfortunately. In Figure 4-2. 3. We avoid this name as it conﬂicts with the commonly held deﬁnition of a block as a contiguous chunk of data. 4 Tuple 3: 0. We refer to one iteration of this layout (the ﬁrst four blocks on each disk in Figure 4-2) as the block design table. However. we may fail to satisfy our large-write optimization Figure 4-1: A sample complete block design.1.2. D1. and one complete cycle (all blocks in Figure 4-2) as the full block design table. D2. and the second counts the pairs in two ways. To see this. in Figure 4-2 are allocated on ﬁve different disks. while meeting our large-write optimization criterion. The ﬁrst of these relations counts the objects in the block design in two ways. Tuple 0: 0. 4 cation. assigning parity to a different element of each tuple in each dupli3. 3 Tuple 1: 0.each surviving disk during the reconstruction of a failed disk.2.3 demonstrates that small block design tables are available for a wide range of parity stripe and array sizes. 1. 4 Tuple 4: 1. every other disk reads exactly λp stripe units while reconstructing the stripe units associated with each block design table. and disks 3 and 4 not at all. Since each object appears in exactly r tuples in a block design table. 2. r = 4. reading ﬁve adjacent data units starting at data unit zero causes disk 0 and 1 to be used twice. Instead. Our mapping identiﬁes the elements of a tuple in a block design with the disk numbers on which each successive stripe unit of a parity stripe are allocated.
This is a difﬁcult problem for general C and G. G = 4 Figure 4-2: Full block design table for a parity declustering organization.1 D7. 3. 2.9 1. Hall also presents a list containing a large number of known block designs.1 D16. . Our goal.2 D1.0 DISK3 P0 D2.Z.0 P19 DISK2 D0. Fortunately.1 D18. 0. For this reason we turn to the theory of balanced incomplete block designs [Hall86].tar.000 tuples in its full block design table.0 D15. In addition to the exorbitant memory requirement for this table. 3.0 P10 P11 P12 D14. 1.0 D2. and states that. 0. Hall presents a number of techniques.1 D4. 2.1 D19.0 D7.2 D16.2. Figure 4-34 presents a scatter plot of a subset of Hall’s list of designs. 1. 0. 1.750.0 D1.2 D8.1 P5 P6 D8.1 DISK4 P1 P2 P3 P4 D6.2 D5. 4. 3.2 D3.7 0. In cases where we cannot ﬁnd a balanced incomplete block design. The block designs we used in our simulations are given in [Holland92]. 1. For example.cmu. When the number of disks in an array (C) is large relative to the number of stripe units in a parity stripe (G) then the size of the block design table becomes unacceptably large and the layout fails our efﬁcient mapping criterion. Whenever possible. 4. 3.0 D12.1 D9. a 41 disk array with 20% parity overhead (G=5) allocated by a complete block design will have about 3.0 D19. we resort to choosing the closest feasible design point. 1.1 D11.2 D3. 2.2 D19.4 0.1 P13 P14 D15.2 D17.0 D5. 3.2 0. within the bounds of this list. 2. 2.1 0. 0.2 D17. 1.8 0. 3.0 D8.nectar.2 D7. 3. 2.1 D14. we attempt to use a complete block design on the indicated parameters.0 D16.1 D18.1 D9.0 D3.2 P7 P8 P9 D10. 0. 1.250. but in many cases they are insufﬁcient for our purposes. but these are of more theoretical interest than practical value since they do not provide sufﬁciently general techniques for the direct construction of the necessary designs. 2. 1.Data Layout on Physical Array Parity Stripe 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Layout Derivation from Block Designs TUPLE 0.0 D17. a solution is given in every case where one is known to exist. 1. Array Size (C) criterion. 0.0 D6. 3.2 D14. our parity declustering implementation uses one of Hall’s designs.2 D12. All of our results indicate that the performance of an array is not highly sensitive to such small variations in α. These block designs (and others) are available via anonymous ftp from niagara. 2. 0.3 0.0 0. 1.0 P15 P16 P17 P18 DISK1 D0. 3. We leave for future work the development of a declustered parity scheme that satisﬁes both of these criteria.6 0. is to ﬁnd a small block design on C objects with a tuple size of G. 1. Sometimes a balanced incomplete block design with the required parameters may not be known. then. 2.0 D5. 2. 100 90 80 70 60 50 40 30 20 10 0 0. 0. 0.2 Full Block Design Table C = 5. that is. 0.2 D13. 1. 2. 2. 0. 0.3.0 D10. 3. the point which yields a value of α closest to what is desired.2 D4. When this method produces an unacceptably large design.0 Declustering Ratio (α) Figure 4-3: Known block designs.1 D10. 1. 2.1 D1.1 D13. 2.0 D11. 1.0 D13. 0. the layout will not meet our distributed parity or distributed reconstruction criteria because even large disks rarely have more than a few million sectors. On the generation of block designs Complete block designs such as the one in Figure 4-1 are easily generated. 2.cs.2 D12.1 D6.2 D11. 3.1 D4.2 D9. 0.200) in the ﬁle /usr0/anon/pub/Declustering/BD_database.1 D2. 1.2 D18.1 D15.edu (128. 2.5 0. 0. 1. 3 4 4 4 4 3 4 4 4 4 3 4 4 4 4 3 4 4 4 4 Parity Block Design Table Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DISK0 D0. 3.
since we have extended the code in such a way that it could be re-incorporated into an operating system at any time. Table 5-1 (c) shows the characteristics of the 314 MB. The simulation environment We acquired an event-driven disk-array simulator called raidSim [Chen90b. 14 heads. At the top level of abstraction is a synthetic reference generator. so the fraction of space consumed by parity units. Fault-free performance Figures 6-1 and 6-25 show the average response time experienced by read and write requests in a fault-free disk array as a function of the declustering ratio. 20.9 ms 2.46 ⋅ dist (ms) 2 ms min. 210. each data point is an average over ﬁve independently-seeded runs of the simulator. rotation time.5. and 378 accesses per second correspond to 5. It is evident that our simulator models a software implementation of a striped disk array. These user accesses expand into even more physical accesses due to four-cycle writes and on-the-ﬂy reconstruction. This assures that reference streams generated by this driver are identical to those that would be observed in an actual disk array running the same synthetic workload generator. 25 ms max 4 sectors Figure 5-1: The structure of raidSim. and (3) the code path length for the mapping functions is not signiﬁcantly longer in our declustered layouts than in the RAID5 case. 17. 18. which was originally the actual code used by the Sprite operating system [Ousterhout88] to implement a RAID device on a set of independent disks. Five accesses/second/disk was chosen as a low load. In our simulated arrays of 21 disks. It consists of four primary components. mand processing time in the CPU.01 ⋅ dist + 0. 12. etc. but we expect a hardware implementation to deliver similar performance. 25. and 18 user reads of 4 KB per second per disk are applied to disks capable of a maximum of about 46 random 4 KB reads per second). This is because (1) we do not model the time taken to compute the XOR functions in the CPU. . and 21 stripe units 33. All reconstruction algorithms discussed in Section 8 have been fully implemented and tested under simulation in our version of the RAID striping driver. with essentially zero modiﬁcation to accommodate simulation. It also forces us to actually implement our layout strategy and reconstruction optimizations.C.5 ms avg. and 18 accesses per second per disk. 10. 210. Berkeley [Katz89]. Our simulated system has 21 disks.7% of the value of the data point. and eighteen was determined to be the maximum supportable user per-disk access write (50% writes) in the fault-free mode. 105. In the 100% write case we show two curves with much longer average response time. 6. 4. ten was chosen as a value close to the maximum supportable workload (50% writes) in reconstruction mode. which is capable of producing user request streams drawn from a variety of distributions. In the 100% read case we show three average response time curves corresponding to user access rates (λ) of 105. Synthetic Reference Generator RAID Striping Driver Disk Simulation Module Event Driven Simulator (a) Workload Parameters Access size: User access rate: Alignment: Distribution: Write Ratio: Fixed at 4 KB 105. which are illustrated in Figure 5-1. which accurately models signiﬁcant aspects of each speciﬁc disk access (seek time. Table 5-1 (a) shows the conﬁguration of the workload generator used in our simulations. 6.). α. We ﬁnd that the 95% conﬁdence intervals are too small to include on any graph. Table 5-1 (b) shows the conﬁguration of our extended version of this striping driver. The simulator was developed for the RAID project at U. In all performance plots. 10. 5. resp. Lee91] for our analyses. 5. These upper two levels of raidSim should actually run on a Sprite machine. which is invoked to cause simulated time to pass. (2) our simulator does not incur rotation slips due to interrupt overhead and com- Stripe Unit: Number of Disks: Head Scheduling: Parity Stripe Size: Parity Overhead: Data Layout: Parity Layout: Power/Cabling: Geometry: Sector Size: Revolution Time: Seek Time Model: Track Skew: Table 5-1: Simulation parameters. corresponding to λ = 105 and 5. 7. is 1/(20α+1). Low-level disk operations generated by the striping driver are sent to a disk simulation module. cylinder layout. 48 sectors/track 512 bytes 13. 10. 14. 7) 50% (Section 8) (b) Array Parameters Fixed at 4KB Fixed at 21 spindle-synchronized disks. RAID5: Left Symmetric Declustered: By parity stripe index RAID5: Left Symmetric Declustered: Block Design Based Disks independently powered/cabled (c) Disk Parameters 949 cylinders. CVSCAN [Geist87] 3. 3 1/2 inch diameter IBM 0661 Model 370 (Lightning) disks on which our simulations are based [IBM0661]. 10. We have restricted our attention to random accesses of size 4 KB to model an OLTP system with an effective buffer cache [Ramakrishnan92]. 6. 6. This minimizes the possibility that un-considered implementation details will lead to erroneous conclusions. The maximum width of a conﬁdence interval over all data points on all plots is 6. and all our modiﬁcations conform to Sprite constraints. At the lowest level of abstraction in raidSim is an event-driven simulator. Each request produced by this generator is sent to a RAID striping driver. 210. Each component is described below. and 378 accesses/second Fixed at 4 KB Uniform over all data 0% and 100% (Sections 6. The striping driver code was originally taken directly from the Sprite source code.0 + 0. 1/G. and 5%. and 3786 random reads of 4 KB per second (on average.
at low declustering ratios.8 0. User writes are much slower than user reads because writes must update parity units as well as data units. and will depend on the access size distribution.4 0. read the . 40 30 20 Fault-Free ________ 10 λ=378 λ=210 λ=105 0. Rosenblum91]. Response Time (ms) 220 200 180 160 140 120 100 80 60 40 20 Fault-Free ________ λ=210 λ=105 Degraded ________ λ=210 λ=105 0 0. Reconstruction performance The primary purpose for employing parity declustering over RAID5 organizations has been given as the desire to support higher user performance during recovery and shorter recovery periods [Muntz90]. Because of this high cost for user writes. Response Time (ms) 50 other data unit in this parity stripe.8 0. This exception is the result of an optimization that the RAID striping driver employs when requests for one data unit are applied to parity stripes containing only three stripe units [Chen90a]. In the rest of this paper we will neglect the case of α = 0. Degraded-mode performance Figures 6-1 and 6-2 also show average response time curves for our array when it is degraded by a failed disk that has not yet been replaced. In this case.6 0.2 0.1 0.4 0.7 0. the failed disk’s contents are reconstructed by reading and computing an exclusive-or over all surviving units of the requested data’s parity stripe. these degradedmode average response time curves suffer less degradation with a lower parity declustering ratio (smaller α) when the array size is ﬁxed (C = 21). because our implementation does not currently meet our maximal parallelism criterion. This effect decreases the workload of the array relative to the fault-free case. except for writes with α = 0.5 0.1.9 1.6 0. then directly write the corresponding parity unit in three disk accesses. and. the reconstruction process reads Figure 6-2: Response time: 100% writes.1 (G = 3) to avoid repeating this optimization discussion.9 1. our system is not able to sustain 378 user writes of 4 KB per second (this would be 72 4 KB accesses per second per disk).5 0.2 0. it gives our results wider generality than would result from simulating speciﬁcally optimized systems. In this case the driver can choose to write the speciﬁed data unit. When such an access is a read. When a user write speciﬁes data on a surviving disk whose associated parity unit is on the failed disk. This is the only effect shown in Figure 6-1 (100% reads).0 α 8. Our striping driver’s faultfree behavior is to execute four separate disk accesses for each user write instead of the single disk access needed by a user read. We also demonstrate that there remains an important trade-off between higher performance during recovery and shorter recovery periods. On the other hand.λ = 210 random user writes of 4 KB per second.0 Figure 6-1: Response time: 100% reads. Although using four separate disk accesses for every user write is pessimistic [Patterson88. rather than four. For each stripe unit on a replacement disk.3 0. Figures 6-1 and 6-2 show that. but for writes there is another consideration. Overall performance for workloads not modeled by our simulations will be dictated by the balancing of these two effects.7 Degraded ________ λ=378 λ=210 λ=105 0. although we also show that previously proposed optimizations do not always improve reconstruction performance. Menon92. G. it may lead to slightly better average response time in the degraded rather than fault-free mode! 0 0. In this case a user write induces only one. Because the amount of work this entails depends on the size of each parity stripe. fault-free performance is essentially independent of parity declustering. in its simplest form. there is no value in trying to update this lost parity. it will not be as able to exploit the full parallelism of the array on large user accesses. In this section we show this to be effective. involves a single sweep through the contents of a failed disk.1 0.3 0. Declustered parity has the advantage of exploiting our large-write optimization with smaller user writes because it has smaller parity stripes. We note that our parity declustering implementation may not perform equivalently to a left-symmetric RAID5 mapping when user requests are larger than one data unit.0 0. an on-the-ﬂy reconstruction takes place on every access that requires data from the failed disk. Reconstruction.0 α 7. instead of pre-reading and overwriting both the speciﬁed data unit and corresponding parity unit in four disk accesses. disk accesses.
This reduces the number of disk accesses invoked by a typical request during reconstruction.0 α λ = 105 ______ Minimal-Update User-Writes Redirect Redirect+Piggyback λ = 210 ______ Minimal-Update User-Writes Redirect Redirect+Piggyback Figure 8-2: Single-thread average user response time: 50% reads. Therefore. Finally. 8. 50% writes. Muntz and Lui also mention that in servicing a user’s write to a data unit whose contents have not yet been reconstructed. Reconstruction Time (sec) Response Time (ms) all other stripe units in the corresponding parity stripe and computes an exclusive-or on these units. Their model assumes that a user-write targeted at a disk under reconstruction is always sent to the replacement. 50% writes.7 0.2 0. an array with declustering ratio 0. and neither reconstruction optimization is enabled. However.0).3 0. which are those generated by applications as part of the normal workload. In our user-writes algorithm (which corresponds to Muntz and Lui’s baseline case).8 0. Because this cannot take less than the three minutes it takes to read all sectors on our disks. minimal reconstruction time occurs when user access is denied during reconstruction.9 1.0 0. Gibson93]. 120 110 100 90 80 70 60 50 40 30 20 10 0 0. This is targeted at speeding reconstruction.5 0. the data unit(s) remains invalid until the reconstruction process reaches it.1 Single thread vs.0 0. which are those generated by a background reconstruction process to regenerate lost data and store it on a replacement disk. In the second optimization. This latter time is termed the reconstruction time. Highly available disk arrays require short repair times to assure high data reliability because the mean time until data loss is inversely proportional to mean repair time [Patterson88.4 0.7 0.8 0.9 1. all user writes explicitly targeted at the replacement disk are sent directly to the replacement.6 0.1 0. the replacement time can be kept sufﬁciently small that repair time is essentially reconstruction time. In the latter case. 13000 λ = 210 ______ 12000 Minimal-Update 11000 User-Writes Redirect 10000 Redirect+Piggyback 9000 8000 7000 6000 5000 λ = 105 ______ 4000 Minimal-Update 3000 User-Writes Redirect 2000 Redirect+Piggyback 1000 0 0. we distinguish between user accesses. user reads that cause on-the-ﬂy reconstruction also cause the reconstructed data to be written to the replacement disk.3 0.1 0. we question whether this is always a good idea. user accesses7 to data that has already been reconstructed are serviced by (redirected to) the replacement disk. For example. and reconstruction accesses.5 0. since those units need not be subsequently reconstructed. continuous-operation systems require data availability during reconstruction.2 0. These ﬁgures show the substantial effectiveness of parity declustering for lowering both the reconstruction time and average user response time relative to a RAID5 organization (α = 1.Muntz and Lui identify two optimizations to a simple sweep reconstruction: redirection of reads and piggybacking of writes. In what follows. the redirect-pluspiggyback algorithm adds the piggybacking of writes opti7. The resulting unit is then written to the replacement disk. rather than invoking on-the-ﬂy reconstruction as they would if the data were not yet available. The redirection algorithm adds the redirection of reads optimization to the user-writes case. In an array that maintains a pool of on-line spare disks. whenever possible user writes are folded into the parity unit. and usually takes much longer. no extra work is sent. distinguished by the amount and type of non-reconstruction workload sent to the replacement disk. mization to the redirection algorithm.0 α Figure 8-1: Single-thread reconstruction time: 50% reads. In the ﬁrst. but for reasons explained below.6 0. even the fastest recon- . In our minimal-update algorithm. at 105 user accesses per second. the device driver has a choice of writing the new data directly to the replacement disk or updating only the parity unit(s) to reﬂect the modiﬁcation.15 reconstructs a failed disk about twice as fast as the RAID5 array while delivering an average user response time that is about 33% lower. we investigate four algorithms.4 0. parallel reconstruction Figures 8-1 and 8-2 present the reconstruction time and average user response time for our four reconstruction algorithms under a user workload that is 50% 4 KB random reads and 50% 4 KB random writes. While Figures 8-1 and 8-2 make a strong case for the advantages of declustering parity. The time needed to entirely repair a failed disk is equal to the time needed to replace it in the array plus the time needed to reconstruct its entire contents and store them on the replacement.
1 0. which uses blocking I/O operations. although it will also further degrade the response time of concurrent user accesses.1 0. The problem here is that a single reconstruction process. For clarity.0 0. This should speed reconstruction substantially. consider single-threaded reconstruction.2 0. reconstructs that stripe. Rudeseal92].struction shown. A more complete explanation follows. 50% writes.5. To further reduce reconstruction time we introduce parallel reconstruction processes. about 60 minutes8.4 0.8 0.0 α Figure 8-4: Eight-way parallel average user response time: 50% reads. For workloads of 105 and 210 user accesses per second.0 0. The addition of piggybacking of writes to the redirection of reads algorithm 8. Because reconstruction time is the time taken to reconstruct all parity stripes associated with the failed disk’s data. With the user loads we employ.8 0.2 Comparing reconstruction algorithms The response time curves of Figures 8-2 and 8-4 show that Muntz and Lui’s redirection of reads optimization has little beneﬁt in lightly-loaded arrays with a low declustering ratio. This gives reconstruction times between 10 and 40 minutes. one at a time. For comparison. is 20 times longer than the physical minimum time these disks take to read or write their entire contents.7 0. Figures 8-1 and 8-2 also show that the two “more optimized” reconstruction algorithms do not consistently decrease reconstruction time relative to the simpler algorithms. reconstruction time is reduced by a factor of four to six relative to Figure 8-1. should have a good chance of meeting its required transaction response time of two seconds.6 0. Figures 8-3 and 8-4 show the reconstruction time and average user response time when eight processes are concurrently reconstructing a single replacement disk9. particularly at low declustering ratios. but can beneﬁt heavily-loaded RAID5 arrays with 10% to 15% reductions in response time.5. they reduce the number of units that need to get reconstructed.2 0. and also because in the piggybacking case.5 0. We will not pursue piggybacking of writes further in this section. We had expected the more optimized algorithms to experience improved reconstruction time due to the off-loading of work from the over-utilized surviving disks to the under-utilized replacement disk. We call this single stripe unit reconstruction period a reconstruction cycle. even the worst of these average response times is less than 200 ms. so a simple transaction such as the one used in the TPC-A benchmark. RAID products available today specify on-line reconstruction time to be in the range of one to four hours [Meador89. 8. piggybacking of writes yields performance that differs very little from redirection of reads alone. 9.7 0. the eight-way parallel minimal-update and user-writes algorithms yields faster reconstruction times than the other two for all values of α less than 0. is intended to reduce reconstruction time without penalty to average user response time. which should require less than three disk accesses per transaction. our unexpected effect can be understood by examining the number of and time taken by reconstructions of a single unit as shown in Figure 8-5. However.9 1. The length of the read phase. It is composed of a read phase and a write phase. and then repeats the process.9 1. Reconstruction Time (sec) 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 λ = 105 ______ λ = 210 ______ Response Time (ms) 200 180 160 140 120 100 80 60 40 20 λ = 210 ______ Minimal-Update User-Writes Redirect Redirect+Piggyback λ = 105 ______ Minimal-Update User-Writes Redirect Redirect+Piggyback 0 0. 50% writes. Each reconstruction process acquires the identiﬁer of the next stripe to be reconstructed off of a synchronized list. . Similarly.3 0. under single-threaded reconstruction. These are surprising results. the time needed to collect and exclusive-or the surviving stripe units.5 0.6 0. response times suffer increases of 35% to 75%. In particular. Still.3 0. The reason for this reversal is that loading the replacement disk with random work penalizes the reconstruction writes to this disk more than off-loading beneﬁts the surviving disks unless the surviving disks are highly utilized.0 α Figure 8-3: Eight-way parallel reconstruction time: 50% reads. the user-writes algorithm yields faster reconstruction times than all others for all values of α less than 0. is not able to highly utilize the array’s disks. is approximately the maximum of G-1 reads of disks that are also supporting a moderate load Minimal-Update User-Writes Redirect Redirect+Piggyback Minimal-Update User-Writes Redirect Redirect+Piggyback 0 0.4 0.
reconstruction is so slow that the loss of this “free reconstruction” beneﬁt dominates and the minimal-update algorithm reconstructs more slowly than the others for all but the smallest declustering ratio. However. For these ﬁnal reconstruction cycles.45 MU UW R α = 1.15 MU UW R RP α = 0. But. must be contrasted with the reduction in number of reconstruction cycles that occurs when user activity causes writes to the portion of the replacement disk’s data that has not yet been reconstructed. Figure 8-7 shows our best attempt to reconcile their model with our simulations. These numbers are averaged over the reconstruction of the last 300 stripe units on a replacement disk.9 1.0 α Reconstruction Time (sec) Figure 8-7: Comparing reconstruction time model to simulation: 210 user accesses/second. 8. 8000 M&L Model __________ 7500 User-Writes 7000 Redirect 6500 Redirect+Piggyback 6000 Simulation _________ 5500 Minimal-Update 5000 User-Writes 4500 Redirect 4000 Redirect+Piggyback 3500 3000 2500 2000 1500 1000 500 0 0. parity [Muntz90].7 0. suggesting a preference for algorithms that minimize non-reconstruction activity in the replacement disk. required because each user write induces two disk reads and two disk writes. in the eight-way parallel case.0 RP (b) Eight-Way Parallel Reconstruction 240 220 200 180 160 140 120 100 80 60 40 20 MU UW R RP α = 0. we apply the following conversions. the piggybacking of writes is not likely to occur. Figure 8-6 presents a sample of the durations of the intervals illustrated in Figure 8-5. Letting the fraction of user accesses that are reads be R.1 0.3 Comparison to analytic model Muntz and Lui also proposed an analytic expression for reconstruction time in an array employing declustered “Optimizations” Read Surviving Disks Write Replacement Read Time (ms) Write Time (ms) 220 200 180 160 140 120 100 80 60 40 20 MU UW R RP α = 0. but the redirection of reads will be at its greatest utility.8 0.4 0. Because their model takes as input the fault-free arrival rate and read-fraction of accesses to the disk rather than of user requests. Simple Algorithms Read Surviving Disks Write Replacement Time Figure 8-5: The effect of redirection and piggybacking on the reconstruction cycle. The ﬁgure shows that in the eight-way parallel case. Their model also requires as input the maximum rate . The single-thread case of Figure 8-6 and Figures 8-1 and 8-3 do not entirely bear out this suggestion because the minimal-update algorithm gets none of its reconstruction work done for it by user requests. Muntz and Lui’s rate of arrival of disk accesses is (4-3R) times larger than our user request arrival rate. We selected this operating point to give the redirection of reads algorithm its maximum advantage. These numbers suggest that with a low declustering ratio the minimal-update algorithm has an advantage over the user-writes algorithm and both are faster than the other two algorithms. reconstruction is fast enough that free reconstruction does not compensate for longer services times experienced by the replacement disk because free reconstruction moves the heads around randomly. the more complex algorithms tend to yield lower read phase times and higher write phase times. as is the case for the other algorithms.2 0.6 0. off-loading a small amount of work from them will not signiﬁcantly reduce this maximum access time.15 MU = Minimal-Update UW = User-Writes R = Redirect RP = Redirect +Piggyback (a) Single Thread Reconstruction MU UW R RP α = 0. Since these disks are not saturated. This effect.0 0.3 0.of random user requests.45 MU UW R α = 1. In the single-threaded case.5 0. and the fraction of their disk accesses that are reads is (2-R)/(4-3R). even a small amount of random load imposed on the replacement disk may greatly increase its average access times because reconstruction writes are sequential and do not require long seeks.0 RP Figure 8-6: Reconstruction cycle time (ms) at 210 user accesses per second.
For greater control of the reconstruction process we intend to implement throttling of reconstruction and/or user workload as well as a ﬂexible prioritization scheme that reduces user response time degradation without starving reconstruction. 74-85. We also ﬁnd that in practice. we intend to explore disk arrays with different stripe unit sizes and user workload characteristics. Previous work proposing declustered parity mappings for redundant disk arrays has suggested. using block designs to map parity stripes onto a disk array insures that both the parity update load and the on-line reconstruction load is balanced over all (surviving) disks in the array. pp. Roland Schaefer. with these disks and random accesses the minimum time required to read or write an entire disk is over 1700 seconds. With an eye to our performance modeling. and John Wilkes for their insightful reviews of early drafts of the paper. 1990. 331338. Jason Lee. Bob Barker. and we would never have met our deadlines without them. We have exploited the special characteristics of balanced incomplete and complete block designs to provide array conﬁguration ﬂexibility while meeting most of our proposed criteria for the “goodness” of a parity allocation. we would like to see an analytical model of a disk array.. Karl Brace.” Proceedings of the 14th Conference on Very Large Data Bases. Terri Watson. Bitton and J. 1988. we hope to install our software implementation of parity declustering on an experimental. “Disk Shadowing. The other manifestation of the problem of using a single disk service rate is their predictions for the beneﬁts of the redirection of reads and piggybacking of writes optimizations. Dan Siewiorek. [Chen90a] P. Conclusions and future work In this paper we have demonstrated that parity declustering. Anurag Gupta. Augustine Kuo. One important concern that neither our nor Muntz and Lui’s work considers is the impact of CPU overhead and architectural bottlenecks in the reconstructing system [Chervenak91]. parallel reconstruction processes are necessary to attain fast recon- Acknowledgments We’re indebted to Ed Lee and the Berkeley RAID project for supplying us with the raidSim simulator. Our disagreements largely result because of our more accurate models of magnetic-disk accesses. and Terri Watson graciously supplied compute cycles on their workstations when time was short. It is this parameter that causes most of the disagreement between their model and our simulations.at which each disk executes its accesses. without implementation. In Figure 8-7 we use the maximum rate at which one disk services entirely random accesses (46 accesses per second) in their model. In particular. Finally. Because redirecting work to the replacement disk in their model does not increase this disk’s average access time. Gray. their predictions for the user-writes algorithm are more pessimistic than for their other algorithms. Dan Stodolsky. Reconstruction time could be further improved by making the parity unit size larger than the data unit size. “An Evaluation of Redundant Arrays of Disks using an Amdahl 5890. This amounts to striping the data on a ﬁner grain than the parity.” Proceedings of the ACM SIGMETRICS Conference. We ﬁnd that their estimates for reconstruction time are signiﬁcantly pessimistic and that their optimizations actually slow reconstruction in some cases. This is consistent with the way Muntz and Lui’s designed their model. To broaden our results. and has analytically modeled the resultant expected reconstruction time [Muntz90]. Art Rudeseal. multiple. Our ﬁndings are strongly in agreement with theirs about the overall utility of parity declustering. a strategy for allocating parity in a single-failure-correcting redundant disk array that trades increased parity overhead for reduced user-performance degradation during on-line failure recovery. high-performance redundant disk array and measure its performance directly. 9. Kevin Lucas. or even of a single drive. that incorporates more of the complexity of disk accesses. John Hagerman. We also have not spent much time on the data mapping in our declustered parity arrays. For example. Lily Mummert. . has proposed two optimizations to a simple sweep reconstruction algorithm. the use of block designs. We view analytical modeling as very useful for generating performance predictions in short periods of time. Our most surprising result is that in an array with a low declustering ratio and parallel reconstruction processes. We feel that parity declustering and related performance/reliability tradeoffs is an area rich in research problems. but we disagree with their projections for expected reconstruction time and the value of their optimized reconstruction algorithms. and to Peter Chen. pp. al. and would allow reconstruction to proceed in larger units. struction. but this paper has demonstrated that care must be taken in their formulation. can be effectively implemented in array-controlling software. our work evaluates their optimizations and reconstruction time models in the context of our software implementation running on a disk-accurate simulator. Howard Read. the simplest reconstruction algorithm produces the fastest reconstruction time because it best avoids disk seeks. which is more than three times longer than our fastest simulated reconstruction time. although this additional load can degrade user response time substantially. a sample of which we give here. but it does not account for the fact that real disks execute sequential accesses much faster than random accesses. This is achieved without performance penalty in a normal (non-recovering) user workload dominated by small (4 KB) disk accesses. Chen. we think many of our layouts may be amenable to both our large write optimization and maximal parallelism criteria. Derek Feltham. Tom Marchok. In addition to extending their proposal to an implementation. References [Bitton88] D. et.
pp. and R. et. February 1988. pp. DeWitt. 1986. “Analysis of File I/O Traces in Commercial Computing Environments. “Performance of a RAID Prototype. al. 24-31. Katz. Garcia-Molina. 322-331.” IEEE Transactions on Computers. 504/114-2. “Multi-Disk Management Algorithms. 1992. pp. Gibson. Combinatorial Theory (2nd Edition). Technical Report 334. “Designing Disk Arrays for High Data Reliability. Model 370. private communication. [Hall86] M. B. Dibble.” Proceedings of the 6th International Data Engineering Conference.” ACM Transactions on Computer Systems. 5(1). 1989.” Journal of Parallel and Distributed Computing. Ousterhout. [Schulze89] M. 143-146. [Kim86] M. [Lee91] E.-I. Proceedings of the 2nd IEEE Conference on Data Engineering. 148-160. 1989. to appear in January. 1990. Walker. 978-988. “Synchronized Disk Interleaving. Chervenak and R. “Gracefully Degradable Disk Arrays. Low End Storage Products.” Ph. Dissertation. 1-15. Geist and S. Gray. 1991. Patterson. [Menon92] J. 1990.[Chen90b] P. “A Comparison of High-Availability Media Recovery Techniques.” Proceedings of ASPLOS-IV. 18-28. 109-116. P. Vol. Hall. 1991. [Holland92] M.. and H. 1989. Also to be published by MIT Press. 23-36. Lee. pp. Patterson. 78-90. [Copeland89] G. pp. Ousterhout. “Software and Performance Issues in the Implementation of a RAID Prototype. [Lee90] E. 1991. [Reddy91] A. Menon and J. pp. “The Sprite Network Operating System. Copeland and T. First Edition. Hsiao and D. Katz. Horst. Kim. 1993. “Maximizing Performance in a Striped Disk Array”.” Proceedings of the ACM SIGMETRICS Conference. and M. Gibson. 1986. 74-83. [Muntz90] R. “A Parallel Interleaved File System. 1990. 162-173.” Proceedings of the 16th Conference on Very Large Data Bases. [Ousterhout88] J. “Redundant Disk Arrays: Reliable. G. 1992. Boral.” Proceedings of the ACM SIGMOD Conference. Bannerjee. “A Project on High Performance I/O Subsystems. pp. Ramakrishnan. pp. 69-77. Parallel Secondary Storage.D. 1988. “The Design and Implementation of a Log-Structured File System. 1989. Reddy and P. Vol. “Chained Declustering: A New Availability Strategy for MultiProcessor Database Machines. UCB/CSD 91/613. H. Rudeseal. [Geist87] R. Keller. S. pp. Katz. “Methods for Improved Update Performance of Disk Arrays. 1986.” IEEE Computer. 1990. 1990. Patterson. . 1992. 35 (11). [Patterson88] D. [Gray90] G. “A Case for Redundant Arrays of Inexpensive Disks (RAID). “Disk Striping”. Gibson and D. [TPCA89] The TPC-A Benchmark: A Standard Speciﬁcation. IBM 0661 Disk Drive Product Description. G. 1991. 1987.” University of California. Katz. Proceedings of ACM SIGARCH Conference. “Parity Declustering for Continuous Operation in Redundant Disk Arrays”. pp. [Salem86] K. [Hsiao90] H. Holland and G. pp. Transaction Processing Performance Council.. Patterson. Lee and R. Muntz and J.” University of Rochester.” Proceedings of the ACM SIGMETRICS Conference.” Proceedings of the ACM SIGMOD Conference. 190-199. Salem. Livny. Meador. Khoshaﬁan. pp. 118-123.” Proceedings of FTCS-21. “Parity Striping of Disc Arrays: Low-Cost Reliable Storage with Acceptable Throughput. pp. pp. Karedla. Vol. [IBM0661] IBM Corporation. [Chervenak91] A. 77-92. Biswas. Rosenblum and J. [Gibson91] G. “How Reliable is a RAID?”. Carnegie Mellon University School of Computer Science Technical Report CMU-CS-92-130. Lui. 17(5). 98-109. pp. [Rudeseal92] A. 1989. “A Continuum of Disk Scheduling Algorithms. and D.” Proceedings of the ACM SIGMETRICS Conference. Kasson. pp. “Performance Consequences of Parity Placement in Disk Arrays.” Proceedings of the 13th ACM Symposium on Operating System Principles. 1990. [Dibble90] P. “Disk Array Systems. pp. Technical Report UCB/CSD 90/573. and R.” Proceedings of the Hawaii International Conference on System Sciences. al. pp. pp. University of California. Schulze. 1992. Daniel. Chen and D.” ACM Computer Architecture News. Gibson. 188-197. et. Proceedings of COMPCON. 1991. [Rosenblum91] M. [Ramakrishnan92] K. WileyInterscience. “Performance Analysis of Disk Arrays Under Failure. [Gibson93] G. [Katz89] R. [Livny87] M. [Meador89] W. 401-408. 1987.” Proceedings of COMPCON.” Proceedings of the 16th Conference on Very Large Data Bases. R. Gibson. 1989. Katz.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.