This Research Was Supported by Brazilian Finep and CNPQ

Page Fault Behavior and Prefetching in Software DSMs
Ricardo Bianchini, Raquel Pinto, and Claudio L. Amorim COPPE Systems Engineering Federal University of Rio de Janeiro Rio de Janeiro, Brazil 21945-970 FAX/Phone: + 55 21 590-2552 Technical Report ES-401/96, July 1996, COPPE/UFRJ
Abstract
Prefetching strategies can conceivably be used to reduce the high remote data access latencies of software-only distributed shared-memory systems (DSMs). However, in order to design e ective prefetching techniques, one must understand the page fault behavior of parallel applications running on top of these systems. In this paper we study this behavior according to its spatial, temporal, and sharing-related characteristics. Among other important observations, our study shows that: a) the amount of useful computation between the earliest time a prefetch can be issued and the actual use of the page is enough to hide most or all the latency of fetching remote data, which means that prefetching techniques have the potential to be e ective; b) page fault patterns change signi cantly throughout execution for several applications, which means that prefetching techniques based on dynamic, recent-history information may not be e ective; c) page faults are frequently spatially clustered but not temporally clustered, which means that sequential prefetching as in hardware DSMs may also be pro table for software DSMs; and d) the set of page invalidations received at synchronization points are often a poor indication of near-future page accesses, which means that the invalidations should not be used to guide prefetching. Based on this study we propose and evaluate ve prefetching techniques for the TreadMarks system running on our simulated network of workstations. Our results show that the prefetching techniques we study can deliver performance improvements of up to 30%, but no one technique is consistently e ective for all applications.
1 Introduction
Software-only distributed shared-memory systems (DSMs) combine the ease of shared-memory programming with the low cost of message-passing architectures. However, these systems often exhibit high remote data access latencies when running real parallel applications. Prefetching strategies can conceivably be used to reduce these latencies.
This research was supported by Brazilian FINEP and CNPq.
In order to design e ective prefetching techniques, one must understand the page fault behavior of parallel applications running on top of software DSMs. Such an understanding should provide insight into whether prefetching in general and speci c prefetching strategies in particular can be e ective. Thus, this paper presents a study of the page fault behavior of applications running on software DSMs and uses its results to guide the design of several prefetching techniques. We concentrate on the spatial, temporal, and sharing-related characteristics of the sequence of page faults that require remote data fetches. Basically, we are interested in answering several important questions about the page fault behavior and its relationship to prefetching: Is there a signi cant amount of useful computation that can be used to tolerate the latency of fetching remote data? Do we need sophisticated compilers to insert prefetch calls to the runtime system in applications or do dynamic, runtime-only techniques su ce? Are page faults spatially and/or temporally clustered? Can prefetching techniques use page invalidations at lock and barrier operations to guide prefetching? This paper answers these questions for the rst time, as far as we are aware. Our simulation results of parallel applications running on top of TreadMarks 8] show that the average amount of useful computation between the earliest time a prefetch can be issued and the actual use of the page is enough to hide most or all the latency of remote data operation for all our applications. Our results also demonstrate that page fault patterns change signi cantly throughout execution for several applications, which means that prefetching techniques based on dynamic, recent-history information may not be e ective; adaptive or compiler-based techniques might be necessary. We nd that page faults are frequently spatially clustered but not temporally clustered, which means that sequential prefetching as in hardware DSMs may also be pro table for software DSMs. In addition, our results show that the set of page invalidations received at synchronization points are often a poor indication of near-future page accesses, which means that the invalidations should not be used to guide prefetching. Based on these results we propose and evaluate ve di erent prefetching techniques. Among other characteristics, the techniques vary in terms of the aggressiveness with which to prefetch, the use of invalidation notices to guide prefetching, and the use of compiler-inserted prefetching calls in applications. We evaluate the techniques when implemented in TreadMarks. Our results show that the prefetching techniques can deliver performance improvements of up to 30%, but no one technique performs consistently well for all applications. The remainder of this paper is organized as follows. The next section motivates the paper by describing the main characteristics of TreadMarks and showing that remote data fetch overheads signi cantly degrade the performance of applications running on top of it. Section 3 describes our simulation methodology and workload. In section 4 we discuss our page fault behavior results. 2
Section 5 describes the prefetching techniques we propose based on the observed page fault behavior. The section also presents results on the performance of these techniques. In section 6 we describe related work. Finally, section 7 draws our conclusions.
2 Overheads in Software DSMs

Several software DSMs use virtual memory protection bits to enforce coherence at the page level. In order to minimize the impact of false sharing, these DSMs seek to enforce memory consistency only at synchronization points, and allow multiple processors to write the same page concurrently 3]. TreadMarks is an example of a system that enforces consistency lazily. In TreadMarks, page invalidation happens at lock acquire points, while the modi cations (di s) to an invalidated page are collected from previous writers at the time of the rst access (fault) to the page. The modi cations that the faulting processor must collect are determined by dividing the execution in intervals associated with synchronization operations and computing a vector timestamp for each of the intervals. A synchronization operation initiates a new interval. The vector timestamp describes a partial order between the intervals of di erent processors. Before the acquiring processor can continue execution, the di s of intervals with smaller vector timestamps than the acquiring processor's current vector timestamp must be collected. The previous lock holder is responsible for comparing the acquiring processor's current vector timestamp with its own vector timestamp and sending back write notices, which indicate that a page has been modi ed in a particular interval. When a page fault occurs, the faulting processor consults its list of write notices to nd out the di s it needs to bring the page up-to-date. It then requests the corresponding di s and waits for them to be (generated and) sent back. After receiving all the di s requested, the faulting processor can then apply them in turn to its outdated copy of the page. A more detailed description of TreadMarks can be found in 8]. The main overheads in software DSMs are related to communication latencies and coherence actions. Communication latencies cause processor stalls that degrade system performance. Coherence actions (e.g. twin generation and di generation and application) can also negatively a ect overall performance, since they accomplish no useful work and are in the critical path of the computation. The impact of communication and coherence overheads is magni ed by the fact that remote processors are involved in all of the corresponding transactions. To demonstrate the extent of the overhead problem, consider gures 1 and 2 1 . Figure 1 presents the speedups achieved by our applications running on top of TreadMarks and shows that all of them perform poorly. Figure 2 explains this result. The gure presents a detailed view of the execution time performance of our applications running on top of standard TreadMarks on 16 processors. The bars in the gure show normalized execution time broken down into busy time, data fetch latency, synchronization time, IPC overhead, and other overheads. The latter category is comprised by TLB miss latency, write bu er stall time, interrupt time, and the most
1
The details of the simulation and application characteristics that led to these gures will be presented in section 3.
7 6 5
Speedup
TreadMarks Running Time
Em3d Radix Ocean FFT
100 90 80 70 60 50 40 30
others ipc synch data busy
4 3 2 1 0 2 4 6 8 10 12 Number of processors 14 16
(%)
20 10 0
Radix
Em3d
Ocean FFT Application
Figure 1: Application Speedups under TreadMarks DSM.
Figure 2: Application Performance under TreadMarks DSM on 16 processors.
signi cant of these overheads, cache miss latency. The busy time represents the amount of useful work performed by the computation processor. Data fetch latency is a combination of coherence processing time and network latencies involved in fetching pages and di s as a result of page access violations. Synchronization time represents the delays involved in waiting at barriers and lock acquires/releases, including the overhead of interval and write notice processing. IPC overhead accounts for the time the computation processor spends servicing requests coming from remote processors. Figure 2 shows that TreadMarks su ers severe remote data fetch and synchronization overheads. IPC overheads are not as signi cant since they are often hidden by data fetch and synchronization latencies. However, IPC overheads gain importance when prefetching is used. This study seeks a complete understanding of each processor's page fault behavior, which can lead to the design of prefetching techniques for signi cantly reducing the overhead of remote data fetches.
3 Methodology and Workload
3.1 Multiprocessor Simulation

Our simulator consists of two parts: a front end, Mint 10], that simulates the execution of the computation processors, and a back end that simulates the memory system ( nite-sized write bu ers and caches, TLB behavior, full protocol emulation, network transfer costs including contention e ects, and memory access costs including contention e ects) in great detail. The front end calls the back end on every data reference (instruction fetches are assumed to always be cache hits). The back end decides which computation processors block waiting for memory (or other events) and which continue execution. Since this decision is made on-line, the back end a ects the timing of the front end, so that the interleaving of instructions across processors depends on the behavior 4
System Constant Name Default Value Number of processors 16 TLB size 128 entries TLB ll service time 100 cycles All interrupts 400 cycles Page size 4K bytes Total cache per processor 128K bytes Write bu er size 4 entries Cache line size 32 bytes Memory setup time 10 cycles Memory access time (after setup) 3 cycles/word PCI setup time 10 cycles PCI burst access time (after setup) 3 cycles/word Network path width 8 bits (bidirectional) Messaging overhead 200 cycles Switch latency 4 cycles Wire latency 2 cycles List processing 6 cycles/element Page twinning 5 cycles/word + memory accesses Di application and creation 7 cycles/word + memory accesses Table 1: Default Values for System Parameters. 1 cycle = 10 ns. of the memory system and control ow within a processor can change as a result of the timing of memory references. We simulate a network of workstations with 16 nodes in detail. Each node consists of a computation processor, a write bu er, a rst-level direct-mapped data cache (all instructions are assumed to take 1 cycle), local memory, and a mesh network router (using wormhole routing). Table 1 summarizes the default parameters used in our simulations. All times are given in 10-ns processor cycles.
3.2 Workload
We report results for four representative parallel programs: Em3d, FFT, Radix, and Ocean. Em3d 4] is from UC Berkeley. FFT is from Rice University and comes with the TreadMarks distribution. Ocean and Radix are from the Splash-2 suite 11]. These applications were run on the default problem sizes for 32 processors, as suggested by the Stanford researchers. Table 2 lists the applications and their input sizes. Em3d simulates electromagnetic wave propagation through 3D objects. We simulate 40064 5
electric and magnetic objects connected randomly, with a 10% probability that neighboring objects reside in di erent nodes. The interactions between objects are simulated for 6 iterations. FFT performs a complex 1-D FFT that is optimized to reduce interprocessor communication. The data set consists of 65,536 data points to be transformed, and another group of 65,536 points called roots of units. Each of these groups of points is organized as a 256 256 matrix. Ocean studies large-scale ocean movements based on eddy and boundary currents. We simulate a 258 258 ocean grid. Radix is an integer radix sort kernel. The algorithm is iterative, performing one iteration per digit of the 1M keys. Application Input Size Em3d 40064 nodes, 10% remote FFT 65,536 data points Ocean 258 258 ocean Radix 1M integers, radix 1024 Table 2: Applications and Input Sizes.
4 Page Fault Behavior

We start this section with an overview of the page fault behavior of applications running a software DSM, and then turn to a detailed analysis of the spatial, temporal, and sharing-related behavior of the sequence of page faults taken by each processor. We are interested in the aspects of the page fault behavior that relate to prefetching. We use TreadMarks as our sample DSM, but the behavior we observe should apply to any lazy release-consistent, page-based system allowing multiple concurrent writers to a page.
4.1 Overview
Given that realistic parallel applications generate an enormous amount of page fault information, in this section we concentrate on representative snapshots of the execution of our applications. Figure 3 presents three consecutive snapshots of the sequence of page faults taken by each of the applications in our suite. Each of the snapshots corresponds to a phase of execution (the time period in between two consecutive barrier events) on one of the processors (processor 7) in the system. The lower and upper limits on the Y-axis of the graphs for each application represent the rst and last pages of its shared data area, respectively. Lock and unlock operations are represented by full and dashed vertical lines, respectively. The graphs in the gure show several important characteristics of the page fault behavior of our applications: 6
263000 262900 262800

Page Number
Em3d (proc7)
263000 262900 262800

Page Number
Em3d (proc7)
263000 262900 262800

Page Number
Em3d (proc7)
262700 262600 262500 262400 262300 262200 Time
262700 262600 262500 262400 262300 262200 Time
262700 262600 262500 262400 262300 262200 Time
262900 262800
Page Number
Fft (proc7)
262900 262800
Page Number
Fft (proc7)
262900 262800
Page Number
Fft (proc7)
262700 262600 262500 262400 262300 262200 Time
262700 262600 262500 262400 262300 262200 Time
262700 262600 262500 262400 262300 262200 Time
265500 265000
Page Number Page Number
265500 265000 264500 264000 263500 263000 Ocean (proc7) 262500 Ocean (proc7)
Page Number
265500 265000 264500 264000 263500 263000 262500 Ocean (proc7)
264500 264000 263500 263000 262500
Time
Time
Time
264000
264000
264000
Page Number
Page Number
263500
263500
Page Number
263500
263000
263000
263000
262500 Radix (proc7)
Time
Time
Time
Figure 3: Overview of Page Fault Behavior for Em3d, FFT, Ocean, and Radix (from top to bottom). 7
The number of page faults within a phase of execution may be signi cant, as in FFT and Radix, and may vary widely across phases, as in Radix. Critical sections delimited by lock and unlock operations have very di erent page fault behavior than the rest of the phase. This can be clearly observed in Radix, but also happens for Ocean; the other two applications do not apply lock synchronization. Page faults are fairly clustered inside small chunks of the shared address space in most cases. Page faults are frequently spread more or less evenly throughout each snapshot, i.e. faults are not usually clustered in time. Although interesting, these observations are too super cial to be useful when designing and evaluating prefetching strategies. For a more precise analysis of these and other characteristics of the stream of page faults experienced by a processor, we must \zoom in" on the spatial, temporal, and sharing-related characteristics of our applications. In this detailed analysis we study the page faults occurring inside and outside of critical sections separately, given the signi cant di erence in their associated page fault behaviors.
4.2 Distribution of Page Faults in Space

In order to understand the spatial properties of a sequence of page faults occurring outside of critical sections, we must determine whether applications exhibit inter-phase fault patterns or intra-phase spatial clustering of page faults. For page faults inside critical sections we must determine whether the sequence of faults exhibits xed patterns across di erent executions of the critical section or any spatial clustering within each section. Spatial clustering is important to prefetching in that a page fault on a certain page is a hint that a nearby page will likely be accessed in the near future. Fixed fault patterns are important in that any past sequence of page faults is a good indication of future sequences of faults. Thus, we must determine whether there exists a pattern of page faults that is frequently repeated across phases or di erent executions of critical sections, and whether the pages faulted on during a phase or critical section are closely clustered in space. We investigate the existence of inter-phase patterns by calculating the percentage of the total number of pages that experience faults (outside of critical sections) in any two consecutive or alternating phases of execution. These results are shown in the graphs on the left side of gure 4. The Em3d graph shows that this application exhibits a pattern of faults that gets repeated in alternating phases. Radix exhibits very little similarity between phases, at most 25%. Ocean and FFT fall in between these two extremes. In the case of FFT, the fault pattern of the third phase is repeated in the fth phase of execution as shown by the peak at the third pair of phases, but not in other phases. The similarity between the sequence of faults of the di erent consecutive phases of Ocean varies quite a bit; from as low as 5% to 100%. We assess the amount of intra-phase clustering by sorting the set of faults that occur outside of critical sections with respect to page numbers and computing the average di erence between each 8
100 Em3d (proc7)
16 Em3d (proc7) 14 12
Similarity Between Phases (%)
95
Average Stride
90 85 80 75 70 0 100 1 2 3 4 5 6 Group of Phases (Alternating) 7 8
10 8 6 4 2 0 14 2 4 Phase 6 8 10
90 80
Fft (proc7) 12 10
Fft (proc7)
60 50 40 30 20 10 0 0 100 Ocean (proc7) 0.5 1 1.5 2 2.5 Group of Phases (Alternating) 3
Average Stride
70
8 6 4 2 0 0 100 Ocean (proc7) 80 1 2 Phase 3 4 5
80
60
Average Stride
60
40
40
20
20
0 Group of Phases (Consecutive) 25 Radix (proc7)
0 Phase 20 Radix (proc7) 15
20
15
Average Stride
10
10
0 0 1 2 3 4 Group of Phases (Consecutive) 5
0 0 1 2 3 Phase 4 5 6
Figure 4: Spatial Distribution of Faults for Em3d, FFT, Ocean, and Radix (from top to bottom). 9
page number and the next in the sorted list. These results are shown in the graphs on the right side of gure 4. The discontinuities in the FFT and Radix graphs represent the fact that certain phases in these applications have less than two page faults, making it impossible to compute strides. These graphs show that FFT exhibits relatively large strides in its rst phase, but faults di s in for a sequential group of pages during the third, fourth, and fth phases of its execution. Em3d and Radix also exhibit very tight clustering of page faults in some of their phases, but not all of them. Ocean's average fault stride around 8 persists for most of its phases, but for the other phases strides can be as large as 500. Page faults that occur inside of critical sections exhibit much simpler behavior. Critical sections tend to be relatively short in the applications we study. In the vast majority of cases, either one or two pages experience faults within the sections. Di erent executions of a critical section exhibit a clear pattern as they tend to access the same pages always.
4.3 Distribution of Page Faults in Time

The temporal distribution of page faults has a signi cant e ect on the amount of network and memory contention that can result from all processors having to fetch di s within a narrow window of time. More importantly for our purposes however, a signi cant amount of time between faults indicates that prefetches can be issued on a page fault and complete before the corresponding data is actually required. In addition, the timing of faults has a direct impact on whether prefetching initiated at synchronization operations has the potential to improve performance; there must exist a reasonable amount of computation between a lock acquire or barrier event and the page faults to hide the latency of the corresponding prefetches. Figure 5 shows some of the data we collected on these two properties of applications. The graphs on the left side of the gure present the average time in processor cycles between page faults (occurring outside of critical sections) during each phase of execution. Again, discontinuities represent phases with one or no page faults. In all cases, the average amount of time between faults is always signi cant, at least 20,000 cycles in all cases, but almost always more than 30,000 cycles. This result shows that page faults are indeed not tightly-clustered in time on average. The graphs on the right side of gure 5 present the average amount of time (in processor cycles) between a barrier event and a page fault occurring outside of any critical section. These graphs show that in most cases there is plenty of time for prefetches to complete before pages are actually accessed, if prefetches are started at barrier events. The only exceptions here are some intervals of Ocean and Radix, where page faults happen almost immediately after the barriers. Page faults that occur inside of critical sections happen almost immediately after the lock acquire operation. There is usually not enough computation to hide the latency of the few prefetches that could be issued. 10
60000 Em3d (proc7)
1.2e+06 Em3d (proc7) 1.1e+06
Average Time Between Faults
Average Time Until Fault

0 2 4 Phase 6 8 10
55000 50000 45000 40000 35000 30000
1e+06 900000 800000 700000 600000 500000 0 1.4e+07 2 4 Phase 6 8 10
110000
100000 90000 80000 70000 60000 50000 40000 30000 20000 0 100000 1 2 Phase 3
Fft (proc7) 1.2e+07
Fft (proc7)
Average Time Until Fault
1e+07 8e+06 6e+06 4e+06 2e+06 0
5 2.5e+06
2 Phase
Ocean (proc7)
Ocean (proc7)
90000 80000 70000 60000 50000 40000 30000 20000 Phase 300000 Radix (proc7) 2.5e+07 Radix (proc7) 0 Phase
Average Time Until Fault Average Time Until Fault
2e+06
1.5e+06
1e+06
500000
250000 200000 150000 100000 50000 0 0 1 2 3 Phase 4 5 6
2e+07
1.5e+07
1e+07
5e+06
0 0 1 2 3 Phase 4 5 6
Figure 5: Temporal Distribution of Faults for Em3d, FFT, Ocean, and Radix (from top to bottom). 11
4.4 Distribution of Page Faults and Its Relationship to Data Sharing

One important aspect of the page fault behavior of applications running on top of software DSMs is the relative percentage of faults caused by sharing, as opposed to cold start accesses. Our experiments show that the contribution of sharing faults to the total page fault rate varies signi cantly for di erent applications. For Ocean and Em3d, the percentage of sharing faults is high (88 and 74%, respectively), while for FFT and Radix it is very low (31 and 8%, respectively). The set of write notices exchanged in the system is another important aspect of software DSMs. Systems based on Lazy Release Consistency delay the transfer of invalidations until lock acquire or barrier events. Ideally, the set of invalidations received at these synchronization points should correspond to the pages that will e ectively be referenced (and therefore must be coherent) during the following critical section or phase of execution. Thus, invalidations might be successfully used to guide prefetching provided that they indeed represent a good indication of the set of pages to be referenced in the near future. Figures 6 and 7 assess whether this is the case for page faults occurring outside and inside of critical sections, respectively. Figure 7 only includes data for Ocean and Radix since the other applications do not involve lock synchronization after the rst barrier event. In order to plot these graphs we computed the intersection between the set of pages invalidated at synchronization operations and the set of pages that e ectively generate page faults during the critical section or phase of execution. The gures plot the size of the intersection as a percentage of the number of pages invalidated for each phase and critical section. The gures show that the set of invalidations is not consistently similar to the set of pages that will cause faults in the near future. Figure 6 demonstrates that only in the case of Em3d are the two sets consistently similar for page fault occurring outside of critical sections. In the case of Ocean the similarity between the sets goes up and down depending on the speci c phase of execution, while only rarely do the sets resemble each other for FFT and Radix. Figure 7 demonstrates that invalidations are a good representation of future page faults for Radix but not Ocean for faults that occur inside of critical sections.
4.5 Discussion
The results we presented in the previous sections describe the page fault behavior of applications running on top of multiple-writer, lazy release-consistent software DSMs. Our understanding of this fault behavior allows us to guide the design of our prefetching techniques based on several observations: 1. The signi cant number of page faults occurring in certain phases of execution suggests that prefetching for all of the corresponding pages at once might cause excessive resource contention. 2. The widely di erent fault behavior across phases of the execution of Radix suggests that xed, runtime-only prefetching techniques may fail to improve performance for some applications. 12
Similarity Between Invals and Faults (%)
90 80 70 60 50 40 30 20 10 0 0 100 2 4 Phase
Em3d (proc7)
100
100 Fft (proc7) 80
60
40
20
0 0 70 Radix (proc7) 60 50 40 30 20 10 0 0 1 2 3 Phase 4 5 6 1 2 Phase 3 4 5
10
Ocean (proc7) 80
60
40
20
0 Phase
Figure 6: Relationship Between Faults Outside of Critical Sections and Sharing for Em3d, FFT, Radix, and Ocean (clockwise from top left corner).
Ocean (proc7) 14 12 10 8 6 4 2 0 Critical Section
16
100 Radix (proc7) 80
60
40
20
0 Critical Section
Figure 7: Relationship Between Faults Inside of Critical Sections and Sharing for Ocean (left) and Radix (right).
13
Some form of adaptive, runtime technique might be a reasonable option, but may require a relatively large number of page faults before adaptation is complete. Compiler analysis should help when fault patterns change or when a large percentage of the misses is not due to sharing, but static techniques are not always applicable or even possible. 3. The di erences between the fault behavior of pages faulted on outside and inside of critical sections suggest that di erent prefetching policies for these two types of faults might be appropriate. The fact that very few pages are accessed inside most critical sections and that these accesses occur almost immediately after the lock acquire is completed (when it is safe to prefetch) indicate that prefetching for these pages is not pro table. However, this does not necessarily mean that we should not prefetch at lock acquire points, since the di s prefetched then might be used later, outside of the critical section. 4. The fact that there is always a signi cant amount of time between a barrier event and each page fault indicates that the latency of the required prefetches can be completely hidden, if prefetches are started at the barrier. 5. Prefetching on a page fault is also a viable option, provided that the processor can guess the very next page that will be required with reasonable accuracy. Given the spatial clustering of faults in FFT, Em3d, and Radix, sequential prefetching might provide appropriate guesses. 6. Prefetching based on the write notices received at synchronization points might not be a good strategy in some cases, since they often do not provide a good indication of future page accesses.
5 Prefetching Techniques for TreadMarks

In this section we propose and evaluate several prefetching techniques for software DSMs. Subsection 5.1 describes the techniques, referring to the observations made in the previous section as their motivation. In subsection 5.2 we evaluate the techniques according to their prefetch utilization and execution time results.
5.1 Description of the Techniques

We propose ve prefetching techniques that can be easily implemented in TreadMarks. The idea behind the techniques is that all operations involved in servicing page faults in standard TreadMarks can be performed in advance of the actual page access. The simplest technique we propose, Naive, assumes that a page that has been recently accessed by a processor, but later invalidated by another one will likely be referenced again in the near future. Thus, the Naive technique prefetches di s for each of the pages invalidated at lock acquire and barrier events, provided that the page is currently valid at the local node. All prefetches are issued during the synchronization operations themselves, as suggested by observation 4 from the previous section. 14
Although simple and intuitive, the Naive strategy can cause four types of performance problems. As suggested by observation 6, one serious problem is that this technique can generate an enormous number of useless prefetches (prefetches for pages that are invalidated before being used), when the write notices received at synchronization operations are not a good representation of near-future page accesses. A second problem with the Naive technique is that prefetches are all clustered in time and, as mentioned in observation 1, may cause several processors to compete for service at remote nodes. More speci cally, this situation can hurt the performance of the remote processor as its associated network interface (during di sends and receives) and the processor itself (during di generation) contend for the access to memory.2 The third type of performance problem that can result from the Naive technique is that prefetches are issued even at the start of short critical sections, which might delay the lock release operation as the processor has to wait for the prefetches to complete before e ectively freeing the lock. The fourth potential problem with our Naive strategy is that it issues prefetches in an order that is not necessarily similar to the order in which page faults will be taken. The OPT1 technique is intended to eliminate the problems mentioned above. The technique still uses write notices to guide di prefetching, but associates two counters with each page that are used to determine how often prefetches of the page are useful. Prefetches are only issued for pages that experience 50% or more useful prefetches, provided that they are currently valid at the local node. In addition, inspired by observations 3 and 5, this technique spreads prefetches out in time by only issuing prefetches at the time of a page fault outside of a critical section; at each such fault, the di s for a single page are prefetched.3 The page to prefetch for is determined by dequeuing an element of one of two lists: the list containing the numbers of the pages that experienced faults in the previous phase of execution, and the list that records the faults of the phase before the previous one. The order in which page numbers appear in each of the lists is the same as the sequence of page faults during the phase. The similarity between the second and third phases (section 4.2) determines which of the lists is to be used throughout the rest of the execution. Similar lists determine that the list corresponding to the previous phase is to be used. Otherwise, the list corresponding to the phase before the last is to be chosen. This strategy attempts to adjust to the common fault behavior of Em3d, where similar phases alternate during execution. The main di erence between the OPT1 and OPT2 techniques is that OPT2 avoids using write notices to guide prefetching altogether, since when the notices are poor descriptions of future faults OPT1 will simply reduce the number of useless prefetches, not increase the number of useful ones. Instead of the write notices, one of the lists of pages faulted on itself is used as a description of future faults. OPT2 does not consider the utility of prefetches either.
In our simulations we assume that, after started, the network interface DMA can complete its transfers without having to compete for resources with the processor. 3 In our system, the rst access to a prefetched page also causes a violation, even if the page is all set to be used. Thus, we also start prefetches on these events when prefetching on faults.
2
15
Technique Naive OPT1 OPT2 Sequential Comp-based
Prefetching Write notice Adaptive Prefetching Sequential No pref in Static at synch ops hints on faults prefetches crit sects Analysis
Table 3: Main Characteristics of Prefetching Techniques Inspired by observation 5, the Sequential technique issues di prefetches for two pages sequentially following or preceding the page faulted on in memory, provided that the fault did not happen inside of a critical section, the candidate page is not currently valid, has been referenced in the past, and does not have an outstanding or completed prefetch for it. 32 candidate pages are tested for these characteristics, and if none of them passes the test the system decides not to prefetch on the current fault. In contrast with Naive, OPT1, and OPT2, this technique could also be implemented to prefetch the pages a processor never referenced before. However, we avoid prefetching these pages, since this could lead to a large number of useless prefetches when the pages prefetched by a processor are not at all part of the its working set. As mentioned in observation 2, compiler analysis might be required to prefetch e ciently for certain applications. The Compiler-based strategy is similar to software prefetching for hardware DSMs 2, 9]. In this technique, prefetch calls are inserted in the application code manually to orchestrate page and di prefetching. Each prefetch call determines prefetching for one or more pages. Before inserting the calls, we traced the pages faults to count the number of faults associated with each page. In order to prefetch without generating substantial overhead for specifying the corresponding pages, we inserted prefetch calls only for the pages that experience signi cant numbers of faults and performed loop unrolling and splitting wherever necessary. Note that the accuracy of this technique can never be matched by a real compiler; the technique is used to evaluate the potential of compiler-based solutions to prefetching. Table 3 summarizes the main characteristics of the prefetching techniques we propose.
5.2 Evaluation of the Techniques

Prefetch utilization. Figure 8 presents the prefetch utilization statistics for each of our applications. A prefetch is considered useful when the page prefetched for is referenced before being invalidated. Each of the bars in the gure shows, from left to right, the percentage of useful and useless prefetches for the Naive, OPT1, OPT2, Sequential, and Compiler-based techniques. All results are normalized to the Naive utilization statistics. The results in the gure show that the Naive technique generates an enormous number of useless prefetches for all applications, except Em3d. The technique is particularly ine ective for Radix. 16
140 129 120 useless useful

Prefetch Utilization (%)
120
100 100 98 100
100
104
100
100
Prefetch Utilization (%)
80 68 60
useless useful
80
81
78 70
60 50 40 50 49
40
35
33
20
20 9 5 3 Radix Application 12
Em3d Application
FFT
Ocean
Figure 8: Prefetch Utilization for Naive, OPT1, OPT2, Sequential, and Compiler-based Techniques (from left to right). The OPT1 and OPT2 techniques indeed reduce the number of useless prefetches in Naive signi cantly. As we expected, Radix represents a major problem for the OPT1 and OPT2 techniques, since most of its faults are not sharing-related and fault patterns do not repeat throughout the execution. OPT1 and OPT2 perform roughly the same for all applications, except for Ocean. For this application, the techniques decide that alternate phases are more similar than consecutive ones based on the initial fault behavior of the application, which turns out to be a poor choice for the later phases of execution. OPT2 su ers much more from this bad decision, since its prefetches are driven by the lists of page faults, while the prefetches in OPT1 are driven by write notices. The Sequential technique performs well for the FFT, Ocean, and Radix applications, where the technique issues at least as many useful prefetches as the more aggressive Naive technique. The di erence in useful prefetches between Sequential and Naive is particularly large for Radix. As we have seen in section 4, in this application past history is a poor description of future accesses, so both the Sequential and Compiler-based techniques achieve a much greater percentage of useful prefetches than all other techniques. In fact, the Compiler-based technique is the best one overall, since it entails the largest number of useful prefetches, except in the case of Ocean. For this application, the technique does not issue prefetches for all pages that cause access faults to avoid excessive computation overhead. Execution Time. Figure 9 presents a detailed view of the execution time performance of our applications running on 16 processors. The time categories in the graphs are the same as in gure 2. The leftmost bar in each graph, Base, represents the standard TreadMarks running time, while the other bars represent the Naive, OPT1, OPT2, Sequential, and Compiler-based prefetching techniques, from left to right. The graphs show that the prefetching techniques we study improve the performance of Em3d, FFT, Ocean, and Radix by as much as 9, 3, 29, and 30%, respectively. All techniques are successful 17
2.5
x 10
Em3d 9 8 100%
x 10
FFT 142% others ipc synch data busy 100% 102% 97% 97% 97%
95%
91%
91%
92%
95%
Number of Cycles
1.5
Number of Cycles
others ipc synch data busy
7 6 5 4 3 2 1
0.5
Base
Naive
Opt1 Opt2 Seq Comp Prefetching Techniques Ocean
Base
Naive
Opt1 Opt2 Seq Comp Prefetching Techniques Radix
10 9 8 7
x 10
4 100% 90% 84% 80% 71%
x 10
214% others ipc synch data busy
99% others ipc synch data busy
3.5
Number of Cycles
6 5 4 3
Number of Cycles
2.5
2 100% 1.5
104%
104%
103%
70% 1
2 1 0 0.5
Base
Naive
Opt1 Opt2 Seq Comp Prefetching Techniques
Base
Naive
Opt1 Opt2 Seq Comp Prefetching Techniques
Figure 9: Running Time Performance of Em3d, FFT, Radix, and Ocean (clockwise from top left corner). at reducing the data fetch overheads of TreadMarks, except in the case of Radix where the Compilerbased prefetching strategy is the only one to do so. In general, the Naive and Compiler-based techniques are the most successful in this respect. Naive achieves an excellent running time performance for Ocean. However, for other applications the gains in Naive are usually surpassed by much higher IPC and synchronization overheads. Synchronization times increase because prefetching makes short critical sections extremely expensive. IPC times increase as a result of prefetching when nodes guess their future access patterns incorrectly and end up prefetching pages they will not actually use. Each useless prefetch causes node interference (in the form of IPC time) that would not occur in Base. In addition, IPC times can increase as a result of competition between the processor and the network interface for access to memory. Sequential, OPT1, and OPT2 do not increase IPC and synchronization overheads as much as Naive, but do not deliver consistently good performance either. The Compiler-based strategy delivers excellent performance to Radix, but is not the ideal choice for the other applications. The 18
main reasons for this result are that the Compiler-based technique su ers the computation overhead of specifying and calling prefetch routines, and also generates increased IPC overheads. In summary, we nd that any one technique to be consistently e ective must be able to reduce data fetch overheads without increasing the IPC and synchronization latencies signi cantly.
6 Related Work
As far as we know, no other study has addressed the page fault behavior of applications running on software DSMs explicitly. Prefetching for software DSMs has received little attention so far 5, 7, 6, 1]. Among other techniques, Dwarkadas et al. 5] studied di prefetching at lock acquire operations. Their cross-synchronization prefetching strategy can be very precise about future page access patterns, but only sends prefetches to the last lock releaser, which might not have an upto-date copy of the data prefetched. In other work, Dwarkadas et al. 6] combine compile-time analysis with runtime support for several sophisticated techniques, including di prefetching. Their prefetching techniques aggregate the di s of several pages in a single reply message whenever possible, therefore achieving a reduction in the number of messages in addition to the overhead tolerance provided by prefetching. The techniques we considered in this paper do not seek to reduce the number of messages, but simply to tolerate data access overheads. The Sparks DSM construction library 7] provides a clean interface for recording page fault histories as used in two of our prefetching techniques. Based on this interface, Keleher describes a technique called prefetch playback that identi es pairs of producer and consumer processors (if any exist) and sends the corresponding updates at barrier events. The content of the updates is determined by histories of faults recorded throughout the previous phase of execution. This strategy relies on page fault patterns to repeat at all phases, which we have shown not always happens. The interface is general enough however, that more sophisticated prefetch techniques can be implemented without much di culty. Our previous work 1] has proposed the use of simple hardware support for aggressively tolerating overheads in software DSMs. In the context of that work we evaluated the Naive prefetching technique both under standard TreadMarks and under a modi ed version of the system that takes advantage of the extra hardware. Our experiments detected the performance problems of the Naive strategy, but showed that it can pro t substantially from our hardware support. All the other techniques studied in this paper should bene t from this support even more than Naive.
7 Conclusions
In this paper we assessed the page fault behavior of parallel applications running on top of software DSMs. Based on several important observations about this behavior, we proposed and evaluated ve di prefetching techniques for the TreadMarks DSM. Simulation results of this system running on a network of workstations showed that our prefetching techniques can deliver performance improvements over standard TreadMarks of up to 30%. However, no technique was consistently 19
e ective for all applications. Nevertheless, we only covered a restricted set of prefetching techniques, so using the behavior information we provide in this paper might lead to new and more pro table techniques. Given the initial results we presented however, our conclusion is that prefetching techniques for software DSMs should only be consistently pro table with hardware support for alleviating IPC and synchronization overheads.
Acknowledgements
We would like to thank Leonidas Kontothanassis for contributing to our simulation infrastructure and for numerous discussions on topics related to the research presented in this paper. We would also like to thank Cristiana Seidel for comments that helped improve this paper.
References
1] R. Bianchini, L. Kontothanassis, R. Pinto, M. De Maria, M. Abud, and C. L. Amorim. Hiding Communication Latency and Coherence Overhead in Software DSMs. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1996. 2] D. Callahan, K. Kennedy, and A. Porter eld. Software Prefetching. Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40{52, April 1991. 3] J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the 13th Symposium on Operating Systems Principles, October 1991. 4] D. Culler et al. Parallel Programming in Split-C. In Proceedings of Supercomputing '93, pages 262{273, November 1993. 5] S. Dwarkadas, A. Cox, H. Lu, and W. Zwaenepoel. Compiler-Directed Selective Update Mechanisms for Software Distributed Shared Memory. Technical Report TR95-253, Department of Computer Science, Rice University, 1995. 6] S. Dwarkadas, A. Cox, and W. Zwaenepoel. An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1996. 7] P. Keleher. Coherence as an Abstract Type. Technical Report CS-TR-3544, Department of Computer Science, University of Maryland, Oct 1995. 8] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter '94 Technical Conference, pages 17{21, Jan 1994. 20
9] T. Mowry and A. Gupta. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87{ 106, June 1991. 10] J. E. Veenstra and R. J. Fowler. MINT: A Front End for E cient Simulation of Shared-Memory Multiprocessors. In Proceedings of the 2nd International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 1994. 11] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24{36, May 1995.
21

This Research Was Supported by Brazilian Finep and CNPQ

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

This Research Was Supported by Brazilian Finep and CNPQ

Uploaded by

Copyright:

Available Formats

Page Fault Behavior and Prefetching in Software DSMs

2 Overheads in Software DSMs

TreadMarks Running Time

Em3d Radix Ocean FFT

others ipc synch data busy

Ocean FFT Application

Figure 1: Application Speedups under TreadMarks DSM.

Figure 2: Application Performance under TreadMarks DSM on 16 processors.

3 Methodology and Workload

3.1 Multiprocessor Simulation

4 Page Fault Behavior

263000 262900 262800

263000 262900 262800

263000 262900 262800

262700 262600 262500 262400 262300 262200 Time

262700 262600 262500 262400 262300 262200 Time

262700 262600 262500 262400 262300 262200 Time

262700 262600 262500 262400 262300 262200 Time

262700 262600 262500 262400 262300 262200 Time

262700 262600 262500 262400 262300 262200 Time

265500 265000 264500 264000 263500 263000 262500 Ocean (proc7)

264500 264000 263500 263000 262500

262500 Radix (proc7)

262500 Radix (proc7)

262500 Radix (proc7)

4.2 Distribution of Page Faults in Space

100 Em3d (proc7)

Similarity Between Phases (%)

90 85 80 75 70 0 100 1 2 3 4 5 6 Group of Phases (Alternating) 7 8

Similarity Between Phases (%)

60 50 40 30 20 10 0 0 100 Ocean (proc7) 0.5 1 1.5 2 2.5 Group of Phases (Alternating) 3

8 6 4 2 0 0 100 Ocean (proc7) 80 1 2 Phase 3 4 5

Similarity Between Phases (%)

0 Group of Phases (Consecutive) 25 Radix (proc7)

0 Phase 20 Radix (proc7) 15

Similarity Between Phases (%)

0 0 1 2 3 4 Group of Phases (Consecutive) 5

4.3 Distribution of Page Faults in Time

60000 Em3d (proc7)

1.2e+06 Em3d (proc7) 1.1e+06

Average Time Between Faults

Average Time Until Fault

55000 50000 45000 40000 35000 30000

1e+06 900000 800000 700000 600000 500000 0 1.4e+07 2 4 Phase 6 8 10

Average Time Between Faults

Fft (proc7) 1.2e+07

Average Time Until Fault

1e+07 8e+06 6e+06 4e+06 2e+06 0

Average Time Between Faults

Average Time Until Fault Average Time Until Fault

Average Time Between Faults

250000 200000 150000 100000 50000 0 0 1 2 3 Phase 4 5 6

4.4 Distribution of Page Faults and Its Relationship to Data Sharing

Similarity Between Invals and Faults (%)

Similarity Between Invals and Faults (%)

100 Fft (proc7) 80

0 0 70 Radix (proc7) 60 50 40 30 20 10 0 0 1 2 3 Phase 4 5 6 1 2 Phase 3 4 5

Similarity Between Invals and Faults (%)

Similarity Between Invals and Faults (%)

Ocean (proc7) 14 12 10 8 6 4 2 0 Critical Section

Similarity Between Invals and Faults (%)

Similarity Between Invals and Faults (%)