Professional Documents
Culture Documents
2) migrate the task to another submachine that is fault- all three reconfiguration opt ions consideM is providzd in
b, Table 11. Theanalysisde~ailscanbe found in 1101.
3, redistribute the Iask pmgrms and the Table U: Approximale ranges, in rnicmsecondf. fa
fault-free PEs in A and complete the task using a TPb and TTre im the FFS,TM,and TR
modified algorithm that does not use the faulty PE. m n f i g m o n options.
These m v e r y options are discussed in detail in [8.101.
3. CHOOSING AN OPTION
3.1. Overview
me time to reconfigureand complete a task for each
reconfigmtionoption listad in Section 2 can be separated
into three primary components: time to plan for the
- lime to move the task data
reconfiguration option (TP&.
Upper bound determined by size of PE memories.
and code (TTwr). and time t complete the task execution
-
(Tee). In this section, the relative impact thm three '
'- " Add Tm for SIMD, MIMD, or mixed-mode tasks.
"'Add TDAfor W D tasks.
components has on the overall reconfigurationcost is dis- o .
< . ..
cussed to derive gurdelines for choosing the best option. 3.3. Range of T-
Experimentally determined ranges for these parameters
on the PASM prototype and the nCUBE 2 are used in the Tasks can be divided into two categories based on
analysis where applicable. Table I summarizes the most execution time. These are tasks with data-independent
important notation-u& in the sections that follow. execution times and tasks with data-dependent execution
times. A task with a data-independent execution time
does not depend on input data to make branching deci-
Table I: Summary of notation used. sions. Thus, the n u m k of times any branch in the task
program c d e is taken can be &ternin& by a compiler
during program compilation, and a compiler can deter-
mine an exuected execution time for the task.In contrast,
a task wit6 a data-dependent execution time has branch
decisions that are based on data that is known only at run
time. In this case, it is assumed that an ewated execu-
tion time for the task can be determined through the use
of empirical studies (i.e., information about task execu-
tion time on various sets of data), an automatic complex-
ity evaluator such as that presented in [S], or through
analysis of the algorithm and data sets.
For all the reconfiguration options discussed, the
number of PEs assigned to a task after the reconfiguration
is equal to or less than the number of PEs orignally
assigned to the task. It is assumed that the average time
for an inter-PE transfer does not increase when the task is
executed on fewer PEs.
'lW= FFS, TM,or TR mxnhgmation option. Once a task has been migrated from a submachine
containing a faulty PE to a huft-free submachine of equal
In practical situarions, the fault-& subdivision and size, the time 10 complete the task execution will be the
msk migrmon options may not l x available. In some sys- same as it would have been on the original fault-freesub-
Ems there may be a minimum size for a submxhine. If
Ihe current submachine is of minimum size. it cannot be
-be he estimated execution time for a
machine. Let q(2')
task on a submachine with zk PEs. It is assumed that the
subdivided. It is also possible that no idle destination sub- total amount of execution time spent on a wk prior to a
machine exists ro which to migrate a task. The task can checkpoint is stored with thal check pin^ If a recovering
h migrakd to a submachirse already being used to exe- task is to proceed from a checkpoinl and he execution
cute another task, and the two tasIrs can time-share the time stored with that checkpoinl is T, b e expected amount
submachine. This is discussed in 1101. of time required to complete task execution after migmt-
3.2. Range of TPh and T- ing the task to another submachine is
The results of an analysis of the range of values that
Th and TTmb can assume on PASM and nCUBE 2 for TF@= a s m a that the submachine size remains the
1993 International Conference on Parallel Processing
same and that all the PEs in the submachineare fault-free. extxution time. Therefore, the range on T : ~ ~is given
,
'Ihe expected range for ~g~~~ is by:
Q < ~ E & E x l c 5-Wk). TY- < T & , ~5 ~g,.
Now, consider the completion time of a task that In cases where an equal disuibution of the task load
completes execution on a subdivision that is half the size among Lhe fault-& PEs is possible, the upper bound of
of the original submachine.
lows:
cF&is bounded as fol- G h in ihe above in d i t y can be replaced by
min((2 12' - 1) T&-, 9s G-). 3y combining the
results of the inequalities for remaining execution time
GP& 5 2 ( ~ ( 2-
~ T}) = 2pwEm. determind in this subsection, the following ordering i s
In addition, it is expected that GF& > T F ~ ~
establishd. ~
becaux h e number ofprocessors in a fault-freesubdivi-
sion is assumed to be hall the number that would be avail-
able if the task was migrated to another submachine. Although the above inequality indicates that task migra-
Although m e tasks can execute faster on fewer PEs [4, tion is the best recon6guration oplion when Tc is
9, 121, it is assumed here rhat the original submachine the dominant factar. il has already been shown g r m s k
size was selected for minimum execution time. That is, if rnigfation i s nor the best option when considering TPh
a smaller submachine could be used to execute the task in and/or TTmP. Therefore, no clear choice is apparent
the same ar less time, the task would have been mapped
-. based on the analysis up ro his poinl.
to hat smaller size submachine initially.
4, PENALTY FOR WRONG CHOICE
A more accurate remaining execution time estimate
can be obtained if q(2'-' ) is known. Then,the estimated Thus far, a quantilative framework has k e n
task execution time becomes a function of the sub- developed that alrempts to relate various reconfiguraion
machine size and the estimate of the remaining execution parameters. Some of the parameteTs can be predicted
time becomes with good precision on real machines, while other param-
eters can only be coarsely bounded. The next step is to
dewmine if a heuristic can be found hat is based on the
in for ma ti^ available. In this section, a combination of
Consistent with the assumptions given above, probabilistic analysis and worst-tax analysis is used to
q(29 < q(2"-') 5 ~ ( 2 ~ Thus,
) . using eilher the ~ ( 2 ' - ') develop useful guidelines for choosing among
information, if it is kmown, or the inequalities slated in the reconfiguration options on real machines in practical
previous pmgraph, the expected range for C F k C
is: situations.
Consider the relative magnitudes of r((2k), Tph,
and TTe. In general, for tasks with short execution
An execution-time estimate for the task redismbu- times, it is better to restart the task when a PE becomes
tion recovery option is more difficull rhan for the p v i - unusable rather than permanently and significantly
ous options. Consider a task executing on a submachine increasing the execution time by including periodic
of size 2k in MlMD m d . If a PE becomes faulty and its checkpin ring. Therefore, dynamic reconfiguration is
subtasks are distributed equally to the 2" - 1 fault-free generally not considered for tasks unless the estimated
PEs in the submachine, the remaining execulion lime is execution rime for the task,~(2'). is orders of magnitude
bounded as follows: larger than TTr* and TPlm.
One of the most common cumulative distribution
functions assumed in reliability models is the exponential
distribution, F (f ) = 1 - e [ 141. represents the p m
Consider the situation where the faulty PE's subtasks bability that a PE fault will occur between time 0 and time
cannot be distributed equally among the fault-free PEs. t, inclusive. The parameter h - describes the rate at which
In the worst case, all the faulty PE's subtasks would k failures m u r in time.
assigned to a single PE and the remaining execution time The reliability funclion, R(I), is defined as
could be twice that of the remaining execution time on a R ( I ) = 1 - ~ ( r ) = e - k . For a parallel system subrnachine
fault-free submxhine. of size 2k PEs,where all the PEs must be operational for
In general, it is expected that T & ~5 G ~ FLE the submachine to be operational, Lhe submachine relia-
because h e fault-free subdivision option can be thouat
of as a subset of the task redistribution option where the
-
bility function, R,, ( r ) , is the product of the individual PE
reliability functions.
task is r e d i s m b d to half the PEs in the original sub- 2' 2'
machine. Funhermore, it is expected that RxM(r)= ~ R ( =
I ne-'
) = e-*h
T C h > T? h x d on the earlier assumption that i =l i=l
the original sGYkhine size was selected for minimum
Thus, the submachine-failure probabili~y disuibution
negligible, and T™Kfr - Tj^fr is generally ontheorder of
hundreds of milliseconds (see Table II). Thus, in this
case, f^enaity is ontheorder of T™^. (recall TT‰ = 0).
Consider the conditional probability that a failure If insteadthetask redistribution option istheoptimal
occurs at or before time .911(2*) given that a failure occurs choice for this example,theworst-case penalty would be:
at or before time T|(2*). 7 ‰ < max(7 ‰, + T‰) - min{T‰ + T‰r).
Again, the best expected time to complete execution
after task redistribution is greater than the expected time
to complete execution after task migration. For PASM
and nCUBE 2, r}‰/ry is on the order of 7"‰ (see Table
II).
This probability approaches 0.9 as T|(2*) approaches zero
from the positive direction, and it monotonically Now, consider the case where it was incorrectly
approaches 1 as T|(2*) increases. therefore, when 11(2*) assumed that (T|(2*) - x) < < TjmΨ I n this situation,
is 100 times greater than TTrnΨ,thereis a 0.9 or greater either the fault-free subdivision or task redistribution
probability that a failure will occur by time 90TTm^, option would have been chosen. If the fault-free subdivi
given that a failure occurs by the time the program has sion option was chosen when the task migration option
completed. Thus, for this case,thereis a high probability would have been better (because in actuality
that x, the time the failure occurs, will be less than or (ri(2*) - x) > > TTmifr), the penalty for making the wrong
k
equal to 90TTrnsfr, and TCmpExec = T\(2 )-x> > TTm≠. choice is given by:
For the case where T|(2*) is more than 100 times greater
than TTrnΨ, there is an even greater probability that
T\(2k)-x>>TT .
Here, the penalty of making the wrong choice of a
reconfiguration option is examined. the worst-case
penalty, 7p‰;o,, is defined to betheworst-case difference
between the expected completion time of a task after
choosing a suboptimal reconfiguration option and the
expected completion time of a task after choosing the
optimal reconfiguration option. For example, if the task
redistribution option was chosen, but the task migration Recall the value of T|(2*) is assumed to be much
option would have resulted intheearliest completion time larger than TTm^r when reconfiguration options are to be
for the task,theworst-case penalty would be: considered. Thus, in the worst case, the penalty for
incorrectly assuming Cn(2*) - T) < < TTrnsfr is much
greater than incorrectly assuming Cn(2*)-x) > > TTrnsfr.
A similar analysis for the case where task redistribution
was erroneously chosen over task migration results in the
where the maximum and minimum refer totheranges for
same potential for a large penalty.
the parameters. Here, two cases are considered: 1) the
reconfiguration choice was made assuming that the To summarize this section, two conclusions are
remaining execution time was much greater than the made: first, it is expected that there is a high probability
time to transfer the task code and data that TcmpE∞c will be much greater than TTrnsfr when a fault
((11(2*) - x) > > TTnsfr), and 2)thereconfiguration choice occurs, and second, that the worst-case penalty for
was made assuming that the remaining execution time incorrectly assuming this is true is far less than the
was much less than the time to transfer the task code and worst-case penalty for incorrectly assuming the opposite.
data((TK2*)-T)<<r rrw> ). therefore, a mathematical justification for choosing a
First, consider the case where it was incorrectly reconfiguration option by considering only the time
k
assumed that (r\(2 ) - x) > > r rm ^.. In this case, from the required to complete the task has been established. Com
results of Subsection 3.3,thetask migration option would bining this result with the results of Subsection 3.3, the
have been chosen. If the fault-free subdivision option is choice of reconfiguration strategy becomes one of choos
the optimal one,theworst-case penalty would be: ing to migratethetask if an idle submachine exists. If this
option is not available, the next best option is task redis
7 ‰ < max(7$L + T‰) - min(7 5„ + TFT™fr), tribution. Finally, if the task does not lend itself to redis
tribution, a fault-free subdivision can be used to complete
because the best expected time to complete execution on the task.
a fault-free subdivision is greater than the expected time the model parameter value ranges established in
to complete execution after task migration. Furthermore, Section 3 are in some cases very coarse, e.g., TCmpExec for
for machines like PASM and nCUBE 2, T‰ - Tffil is tasks with data-dependent (nondeterministic) execution
III-251
1993 International Conference on Parallel Processing