Professional Documents
Culture Documents
Abstract—We study the tradeoffs between Many-Core machines like Intel’s Larrabee and Many-Thread machines like Nvidia
and AMD GPGPUs. We define a unified model describing a superposition of the two architectures, and use it to identify
operation zones for which each machine is more suitable. Moreover, we identify an intermediate zone in which both machines
deliver inferior performance. We study the shape of this “performance valley” and provide insights on how it can be avoided.
—————————— ——————————
1. INTRODUCTION
gree of multithreading was studied in the early 90's by (tavg is developed to account for the probability of finding
Agrawal [1]. In retrospect, Agrawal's analysis can be seen the data in the cache in Equation (3).)
as applicable to our MC region, and it observes a similar During this stall time, the PE is left unutilized, unless
trend to the one exhibited in the leftmost area of our other threads are available to switch-in. The number of
curve: with the first few threads, performance soars, but additional threads needed in order to fill the PE’s stall
then hits a plateau, as caches become less effective. Given tavg ⎛ rm ⎞
(dated) parameter values from the 90's, Agrawal found time is , and hence N PE ⋅ ⎜ 1 + tavg ⋅ ⎟
that as little as two or four threads are sufficient to CPI exe rm ⎝ CPI exe ⎠
achieve high processor utilization. Our work takes the threads are needed to fully utilize the entire machine.
level of parallelism much further, to tens of thousands of Processor utilization (0 ≤ η ≤ 1) (i.e., the average utiliza-
threads, and observes that the plateau is followed by a tion of all PEs in the machine) is given by:
valley, and then by another uphill slope (the MT region), ⎛ ⎞
which in some cases even exceeds the MC peak. ⎜ ⎟
η = min ⎜ 1 , ⎟
nthreads
While the exact shape of the curve depends on numer- (1)
⎜ ⎛ rm ⎞ ⎟
ous parameters, the general phenomenon is almost uni- ⎜ ⋅
PE ⎜ 1 + ⋅ ⎟
CPI exe ⎟⎠ ⎠
N t avg
versal. This illustrates the major challenge that processor ⎝ ⎝
designers today face - how to stay away from the valley? where nthreads is the degree of multithreading (the number
Indeed, the challenge for MC designers is to extend the of threads available in the workload). Our model assumes
MC area to the right and up, so as to be able to exploit that all threads are independent of each other (as ex-
higher levels of parallelism. The challenge for their MT plained in Section 1), and that thread’s context is saved in
counterparts is to extend the MT zone to the left, so as to the cache (or in other dedicated on-chip storage) when
be effective even when a high number of independent threads swap out. The machine can thus support any
threads is not available. In Section 3, we discuss how number of in-flight threads as long as it has enough stor-
various parameters (of the machine or the workload) im- age capacity to save their contexts. For simplicity, Equa-
pact the shape of the valley, providing insights on how tion (1) neglects the thread swap time, though it can be
the above challenges may be addressed. easily factored in. The minimum in Equation (1) captures
Finally, we note that we focus on workloads that can the fact that after all execution units are saturated, there is
be parallelized into a large number of independent no gain in adding more threads to the pool.
threads with practically no serial code. It is already well- Finally, the expected performance is:
understood that for serial code, MC machines signifi- f
Performance [GOPS ] = N PE ⋅ ⋅η (2)
cantly outperform MT ones, and that for applications that CPI exe
alternate between parallel and serial phases asymmetric The system utilization ( η ) is a function of two vari-
machines are favorable [4][6]. Our model instead focuses ables, nthreads and tavg , where tavg is affected by the memory
on workloads that offer a high level of parallelism, where hierarchy (the access times of caches and external mem-
the questions we set to answer are still open. ory) and the behavior of the workload (which determines
cache hit rate). tavg can be approximated as follows:
2. MT/MC UNIFIED MODEL tavg = Phit ( S$thread ) ⋅ t$ + (1 − Phit ( S$thread )) ⋅ tm , (3)
In order to study the Many-Cores/Many-Threads where Phit ( S$
thread
) is the hit rate ratio for each thread in the
tradeoff, we present a unified model describing a super- thread
position of the two architectures. We provide a formula to application assuming it can exploit a cache of size S $ .
predict performance (in Giga Operations Per Second-GOPS), Given a shared pool of on-chip cache resources, the cache
given the degree of multithreading and other factors. Our size available for each thread decreases as the number of
model machine is described by the following parameters: threads grows. Hence, Equation (3) can be rewritten as:
NPE - Number of PEs. (Simple, in-order processing S ⎛ S ⎞
elements.) tavg = Phit ( $ ) ⋅ t$ + ⎜ 1 − Phit ( $ ) ⎟ ⋅ tm (4)
S$ - Cache size [Bytes]
nthreads ⎝ nthreads ⎠
1100
3. RESHAPING THE VALLEY 1000
900
Both MC machines and MT machines strive to stay 800 Rising compute/memory compute/mem=5
away from the performance valley, or to flatten it. In this 700 compute/mem=10
GOPS
compute/mem=20
section, we study how different parameters affect the per- 600
500 compute/mem=100
formance plot, via an example design. We first present an 400 compute/mem=200
example system and an analytical hit rate function chosen 300 pure compute
200
to characterize the workload. We then exemplify how 100
workload attributes such as compute intensity (Section 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
3.1); and how hardware attributes such as memory la- Number Of Threads
tency (Section 3.2) and cache size (Section 3.3), effect the Fig. 3. Performance for different compute/memory ratios.
shape of the performance plot. While in these sections we
assume for the sake of clarity an unlimited bandwidth to Fig. 3 shows that the more computation instructions
memory, we consider bandwidth to be a principal per- per memory instructions are given, the steeper perform-
formance limiter and account for the effect of a limited ance climbs. The above trend results from the fact that as
off-chip bandwidth in Section 3.4. more instructions are available for each memory access,
The example system we use in this section consists of fewer threads are needed in order to fill the stall time re-
1024 PEs and a 16MB cache. We assume a frequency of sulting from waiting for memory. Moreover, the penalty
1GHz, a CPIexe of 1 cycle, and an off-chip memory latency of accessing memory is amortized by the small portion of
(tm) of 200 cycles. The peak performance of the example accesses out of the total instruction mix. All in all, a high
machine is 1 Terra OPS (Equation (2) with η =1). In our compute/memory ratio decreases the need for caches,
baseline workload, 1 out of 5 instructions is a memory and eliminates the valley.
instruction, i.e. rm=0.2. Applications with more computation per memory ac-
The problem of finding an analytical model for the cess (for example the light-gray lines) reach peak per-
cache hit rate function has been widely studied ([1][2][9] formance much faster, and can even avoid the valley en-
among others). In this paper, we use the following simple tirely since there is no significant memory latency to
function [5]: screen. Moreover, in such applications there is practically
− (α −1) no difference between MC and MT as caches have almost
⎛S ⎞
Phit ( S$ ) = 1 − ⎜ $ + 1⎟ (5) no effect. In other words, a higher value of the com-
⎝β ⎠ pute/memory ratio makes the workload more suitable for
The function is based upon the well known empirical MT machines. This trend has been identified by MT ma-
power law from the 70’s (also known as the 30% rule or chine makers, who suggest aiming for a high com-
√2 rule) [3]. In Equation (5), workload locality increases pute/memory ratio as a programming guideline [7].
when increasing α or decreasing β.
1100 3.2 Memory Latency Impact
1000
900 infinite cache We next examine how the latency to off-chip memory
800 α=8.0, β=25
α=7.0, β=30
(given in number of cycles needed to access the data, i.e.
700
tm) affects the shape of the performance plot. Fig. 4 pre-
GOPS
α=7.0, β=50
600
500
Rising workload locality α=6.0, β=100
no cache
sents the performance plot for several different off-chip
400 memory latencies.
300 1100
200 1000
100 900
800 zero latency
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 50 cycles
700
GOPS
performance graph. Fig. 5 presents several performance the more in-flight threads there are, the less cache is avail-
plots differing in their cache size. able to each one of them. Therefore, when the off-chip
1100
1000
bandwidth wall is met, adding more threads only de-
900
no cache
grades performance due to increasing off-chip pressure.
800
700
4 MB The bandwidth wall in our example causes some of the
GOPS
8 MB
600 16 MB plots to never reach the point where the MT area exceeds
the performance of the MC area. This is because the BW
500 32 MB
400 Rising cache capacity 64 MB
300
100 MB
infinite cache
wall limits the performance of the MT region but not that
200
100
of the MC region, where caches can dramatically reduce
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the pressure on off-chip memories. In other words, MT
Number Of Threads machines, relying on very high thread counts to be effec-
Fig. 5. Performance for different cache sizes. tive, dramatically increase the off-chip bandwidth pres-
sure, and are hence in need of very aggressive memory
As can be seen in Fig. 5, cache size is a crucial factor for channels to be able to deliver performance (e.g., up to
MC machines, as larger caches extend the MC area up 150GB/sec in Nvidia’s current GPUs). MC machines, on
and to the right. This is because larger caches suffice for the other hand can suffice with relatively slimmer chan-
more in-flight threads, thus enabling the curve to stay nels, as their caches screen out most of the data accesses
longer with the “perfect cache” scenario. (e.g., 20-30GB/sec in Intel’s Nehalem processor).
3.4 Off-Chip Bandwidth Impact
In the previous sections, we assumed an unlimited band- 4. SUMMARY
width to external memories. Alas, off-chip bandwidth is a We studied the performance of a hybrid Many-Core (MC)
principal bottleneck and may limit performance as we and Many-Thread (MT) machine as a function of the
show in this Section. In order to take into account the role number of concurrent threads available. We found that
of off-chip bandwidth, we present the following new no- while both MC and MT machines may shine when given
tations: a suitable workload (in number of threads), both suffer
breg- Size of operands [Bytes] from a “performance valley”, where they perform poorly
BWmax- Maximal off-chip bandwidth [GB/sec] compared to their achievable peaks. We studied how sev-
The latter is an attribute of the on-die physical channel. eral key characteristics of both the workload and the
Using the above notations, off-chip bandwidth can be hardware impact performance, and presented insights on
expressed as: how processor designers can stay away from the valley.
BW = Performance ⋅ rm ⋅ breg ⋅ (1 − Phit ) , (6)
ACKNOWLEDGMENT
where performance is given in GOPS.
Given a concrete chip with a peak off-chip bandwidth We thank Ronny Ronen, Michael Behar, and Roni Rosner.
limit (BWmax), the maximal performance achievable by a This work was partially supported by Semiconductors
Research Corporation (SRC), Intel, and the Israeli Ministry
machine is BWmax / ( rm ⋅ breg ⋅ (1 − Phit ) ) . Fig. 6 presents the
of Science Knowledge Center on Chip MultiProcessors.
performance plot with several different bandwidths lim-
its, assuming all operands are 4 bytes long. (i.e., single- REFERENCES
precision calculations.)
1100 [1] A. Agrawal, “Performance Tradeoffs in Multithreaded Proces-
1000 50 GB/sec sors, ” IEEE Trans. on Parallel and Distributed Systems, 1992
100 GB/sec
900
150 GB/sec [2] A. Agarwal, J. Hennessy, and M. Horowitz, “An analytical
800
700
200 GB/sec
250 GB/sec
cache model,” ACM Trans. on Computer Systems, May 1989
GOPS
600 300 GB/sec Rising off-chip BW capacity [3] C. K. Chow, “Determination of Cache's Capacity and its Match-
unlimited BW
500
ing Storage Hierarchy,” IEEE Trans. Computers, vol. c-25, 1976
400
300 [4] M. D. Hill, and M. R. Marty, “Amdahl's Law in the Multicore
200 Era,” IEEE Computer, vol. 46, July 2008
100 [5] B. L. Jacob, P. M. Chen, S. R. Silverman, and T. N. Mudge, “An
0
0 5000 10000 15000 20000 25000 30000 35000 40000 Analytical Model for Designing Memory Hierarchies,” IEEE
Number Of Threads Trans. Computers, vol. 45, no 10, October 1996
Fig. 6. Performance with limited off-chip bandwidth. [6] T. Y. Morad, U. C. Weiser, A. Kolodny, M. Valero, and E. Ay-
guadé, “Performance, Power Efficiency, and Scalability of
At the rightmost side of the plot, where all accesses to Asymmetric Cluster Chip Multiprocessors,” Computer Archi-
data are served from memory, performance converges to tecture Letters, vol. 4, July 2005
BWmax /( rm ⋅ breg ) . However, some of the plots exhibit reduc- [7] NVIDIA, “CUDA Programming Guide 2.0,” June 2008
[8] L. Seiler, et al., “Larrabee: a many-core x86 architecture for
tion in performance after reaching a peak at the MT re-
visual computing,” SIGGRAPH 2008
gion. To explain why this happen recall that Phit is af-
[9] D. Thiebaut, and H. S. Stone, “Footprints in the cache,” ACM
fected by the number of threads in the system, because
Trans. on Computer Systems (TOCS), Nov. 1987