Professional Documents
Culture Documents
Instr
D1
Instr D2
3e+09 Branch
D1
D2 2e+09 Sum
Branch Actual
Sum
cycles
cycles
2e+09 Actual
1e+09
1e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
compress compress2
6e+09 5e+09
Instr
5e+09 D1 Instr
D2 4e+09
D1
Branch D2
4e+09 Sum Branch
Actual Sum
3e+09
Actual
cycles
cycles
3e+09
2e+09
2e+09
1e+09
1e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
fulsom gamteb
3e+09 5e+09
Instr 4e+09
Instr
D1 D1
2e+09 D2 D2
Branch Branch
Sum 3e+09
Sum
cycles
cycles
Actual Actual
2e+09
1e+09
1e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
cycles
3e+09
2e+09
2e+09
1e+09
1e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
infer parser
6e+09 6e+09
5e+09 5e+09
cycles
D2 Branch
3e+09 Branch 3e+09 Sum
Sum Actual
Actual
2e+09 2e+09
1e+09 1e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
rsa symalg
4e+09 5e+09
Instr
4e+09 D1
3e+09 D2
Branch
Sum
3e+09 Actual
cycles
cycles
2e+09 Instr
D1
D2 2e+09
Branch
Sum
1e+09 Actual
1e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
ghc ghc-O
1e+10 2.4e+10
9e+09 2.2e+10
2e+10
8e+09
Instr 1.8e+10
7e+09 D1
D2 1.6e+10 Instr
Branch D1
6e+09 1.4e+10 D2
Sum
cycles
cycles
Actual Branch
5e+09 1.2e+10 Sum
1e+10 Actual
4e+09
8e+09
3e+09
6e+09
2e+09
4e+09
1e+09 2e+09
0 0
0 50 100 150 200 250 300 350 400 450 500 550 0 50 100 150 200 250 300 350 400 450 500 550
nursery size (KB) nursery size (KB)
1e+09
SML Program Des
ription
ount-graphs Graph manipulation
logi
Simple Prolog-like interpreter
ray Ray tra
er
0 simple Spheri
al
uid dynami
s simulation
0 10 20 30 40 50 60 70 tsp Travelling salesperson
initial heap size (MB)
C Program
hpg
g
C
ompiler, v3.0.4
7e+09
gzip LZ77
ompression, v1.3
6e+09 latex LATEX2"/TEXv3.14159 typesetter
perlparse Assembly
ode parser (in Perl v5.6.0)
5e+09 vpr FPGA pla
ement and routing tool
4e+09 Instr
Figure 7: SML and C program des
riptions
cycles
D1
3e+09 D2
Branch
Sum
Actual
The results for all programs are given in Figure 8, whi
h
ontains a key explaining the
olumns. The exe
ution gures
2e+09
1e+09 (
olumns 4{5, 9{12) were obtained with the hardware
oun-
ters and the garbage
olle
tor gures (
olumns 6{8) with
0
0 10 20 30 40 50 60 70
the -s runtime option.
initial heap size (MB) The immediate
on
lusions are that GHC programs and
compress2
SML/NJ programs are memory intensive, with higher in-
5e+09
stru
tion / memory a
ess ratios than the C programs. This
Instr
D1
is be
ause they both use
opying garbage
olle
tion, and al-
D2 lo
ate memory furiously (58{287MB per se
ond of mutator
4e+09 Branch
Sum time for the Haskell programs, even higher for SML/NJ).
Actual
But the CPI gures are most
ompelling|the GHC pro-
3e+09 grams range from 1.0{4.3, with most in the 1.5{3.0 range,
cycles
5e+09
5.1 The Model
Instr The model takes into a
ount only the four largest
om-
4e+09
D1
D2 ponents of exe
ution time: instru
tion exe
ution, stalls due
Branch
Sum
to L1 and L2
a
he data misses, and stalls due to bran
h
mispredi
tions. The model is:
cycles
3e+09 Actual
2e+09
y
les = 0:8I + 12C1 + 206C2 + 10B
where I is the number of instru
tions exe
uted, C1 and C2
1e+09
are the number of D1
a
he misses and L2
a
he data misses
0
respe
tively, and B is the number of mispredi
ted bran
hes.
5
0 10 20 30 40 50 60 70
The program perlparse is written in Perl, but the Perl
initial heap size (MB)
interpreter is written in C.
6
Two of the C programs are used in the SPEC ben
hmarks;
Figure 6: Ee
ts of varying initial heap size the inputs we used were not from the SPEC ben
hmarks, but
hosen to give similar running times to the Haskell programs.
GHC program Nurs. Heap0 Instr Mem GC Allo
'd Copied L2 Sys BrMis CPI
anna 96KB 0MB 1598M 86% 23% 133MB 35MB 7.9 2.0 58 2.1
a
heprof 160KB 0MB 1484M 72% 33% 222MB 33MB 5.7 1.2 30 1.4
ompress 160KB 0MB 3301M 83% 30% 532MB 65MB 5.2 0.8 24 1.2
ompress2 96KB 32MB 972M 73% 46% 169MB 42MB 10.1 8.4 20 2.9
fulsom 288KB 28MB 717M 82% 28% 183MB 10MB 9.7 6.9 40 2.9
gamteb 128KB 0MB 2301M 73% 17% 350MB 25MB 4.1 0.9 27 1.4
hidden 128KB 0MB 1857M 85% 3% 535MB 2MB 5.8 0.4 48 1.5
hpg 96KB 0MB 2018M 69% 19% 533MB 23MB 7.5 0.4 23 1.2
infer 512KB 32MB 1523M 90% 29% 109MB 35MB 15.7 4.0 59 2.6
parser 288KB 32MB 1517M 75% 34% 287MB 41MB 7.2 5.9 38 2.6
rsa 128KB 0MB 2853M 52% 6% 188MB 3MB 1.3 0.1 9 1.0
symalg 192KB 0MB 1013M 49% 1% 410MB 1MB 6.0 0.3 2 4.3
gh
448KB 32MB 2232M 78% 17% 480MB 32MB 9.8 6.7 46 3.1
gh
-O 320KB 32MB 4580M 79% 23% 922MB 98MB 11.5 7.7 49 3.4
SML/NJ program
ount-graphs 256KB 512KB 9598M 78% ? 3122MB 14MB 6.0 0.3 10 1.2
logi
256KB 512KB 1528M 80% ? 534MB 14MB 11.6 1.2 14 1.6
ray 256KB 512KB 1003M 64% ? 540MB 347KB 13.2 0.9 12 1.5
simple 256KB 512KB 920M 75% ? 269MB 157KB 8.4 1.1 9 1.3
tsp 256KB 512KB 1943M 59% ? 25MB 3MB 8.5 1.1 7 1.5
C program
g
{ { 2250M 59% { { { 3.6 1.6 21 1.5
gzip { { 3189M 46% { { { 13.1 0.3 10 1.1
latex { { 1741M 69% { { { 2.7 0.4 17 1.2
perlparse { { 1778M 67% { { { 3.7 0.1 27 1.4
vpr { { 1958M 70% { { { 5.3 0.2 14 1.3
Nurs: Nursery size (KB)
Heap0 : Initial heap size (MB)
Instr: Retired instru
tions (Athlon event 0x
0)
Mem: Data
a
he a
esses / Retired instru
tions (0x40 / 0x
0)
GC: Garbage
olle
tor time %
Allo
'd: Megabytes allo
ated on heap
Copied: Megabytes
opied during garbage
olle
tion
L2: Data
a
he rells from L2 / Retired instru
tions 1000 (0x42 / 0x
0 1000)
Sys: Data
a
he rells from system / Retired instru
tions 1000 (0x43 / 0x
0 1000)
BrMis: Retired bran
hes mispredi
ted / Retired instru
tions 1000 (0x
5 / 0x
0 1000)
CPI: Cy
les / instru
tion retired (
y
les / 0x
0)
Figure 8: Program
hara
teristi
s
The
onstants were
hosen for the following reasons: 12 and tor (e.g. fulsom). For
ompress the model be
omes a little
206 for the
a
he misses be
ause they are the worst-
ase less a
urate as the nursery size in
rease; for hidden it be-
numbers reported by Calibrator; 10 for bran
h mispredi
-
omes more a
urate. The graphs in Figure 6 also show an
tions be
ause page 208 of [1℄ says \In the event of a mis- impressive mat
h between \Sum" and \A
tual" values.
predi
t, the minimum penalty is ten
y
les"; and 0.8 for Only for symalg in Figure 5 are the predi
tions badly
unstalled instru
tions quite arbitrarily (a three-way ma
hine wrong. This is be
ause 8.3% of its instru
tions are div
with multiple fun
tional units
ould retire unstalled instru
- instru
tions|from the GNU multi-pre
ision library whi
h
tions faster than this). GHC uses to implement innite pre
ision integers|whi
h
take 42
y
les on the Athlon ([1℄ p. 270). If we assume
5.2 Justification 0.8
y
les per unstalled instru
tion for the remaining 91.7%,
One would expe
t that this model is far too simple for the the average
y
les per unstalled instru
tion jumps to 4.3; if
Athlon, an aggressive, out-of-order, three-way supers
alar we took this into a
ount the \Sum" line would be mu
h
pro
essor with nine fun
tional units. And yet, it is surpris-
loser to the \A
tual" line, although its shape would still be
ingly a
urate. Returning to the graphs in Figures 4{6, the wrong.
four
omponents are shown as the \Instr", \D1", \D2" and What
an we dedu
e from this surprising a
ura
y?
\Bran
h" lines. The \Sum" line is their sum. 1. The L2
a
he data stall times are believable. If the
Consider the results from varying nursery size in Figures 4 penalty of 206
y
les was an overestimate or under-
and 5. For eleven of the fourteen programs, the model gives estimate, the shape of the \Sum" and \A
tual" lines
a
orre
tly shaped graph that is almost spot on the real would not mat
h so well for the programs in whi
h in
value (e.g. hpg) or underestimates by a small
onstant fa
- the number of L2 misses
hanges a lot while the other
omponents do not
hange very mu
h (e.g. gamteb, anna 20 28 52
hpg, rsa). Little if any useful work is being done dur- cacheprof 17 20 63
ing L2 data miss stalls. compress 12 20 68
y les per unstalled instru tion, else the shape of the fulsom 49 14 37
ure 6). This omponent overs instru tion exe ution infer 32 23 45
symalg 99
3. The bran
h mispredi
tion times should be an a
u- ghc 45 15 41
rate lower bound, assuming ten
y
les is the minimum
mispredi
tion penalty. This
omponent may be under-
ghc-O 47 15 38
cacheprof 3 13 2 16 6 2 5 34 19 1
compress 4 24 10 2 11 42 15 1
compress2 2 17 24 3 4 6 34 8
fulsom 1 26 2 2 2 61 3 1
gamteb 1 4 1 21 3 5 5 32 11 14 3
hidden 2 18 3 5 3 8 59 2 11
hpg 2 10 9 4 2 23 30 17 3 1
infer 6 13 23 4 1 28 18 6
parser 5 27 3 3 1 53 7
rsa 111 4 1 2 3 23 5 30 30
symalg 11 11 1 2 11 2 92
ghc 2 10 3 13 2 2 6 55 4 3
ghc-O 2 13 2 19 2 2 6 46 5 2
mov (r) jmp1 (r) jmp2 (r) evac (r) copy (r) scav (r) Other (r) mov (w) copy (w) Other (w) Unann.
Program GC Prog Both Theory Ratio write-no-allo
ate
a
hes, be
ause most data is referen
ed al-
anna 3% 0% 4% 5% 0.8 most immediately after allo
ation.
a
heprof 3% 1% 5% 9% 0.6 Diwan, Tarditi and Moss made very detailed simulation of
ompress 2% 2% 5% 7% 0.7 Standard ML programs, in
luding the ee
ts of parts of the
ompress2 3% 9% 12% 25% 0.5 memory system usually ignored su
h as the write buer and
fulsom 1% 17% 17% 31% 0.6 TLB [6℄. They
ompared dierent write-miss strategies, and
gamteb 0% -0% 0% 6% 0.0 found that using sub-blo
k pla
ement
ut
a
he miss rates
hidden 1% 3% 2% 3% 0.7 signi
antly.
hpg 4% 1% 4% 3% 1.3 Gon
alves and Appel also made detailed measurements
infer 3% 8% 9% 8% 1.1 of Standard ML programs [8℄. They found the miss rates of
parser 1% 19% 22% 28% 0.8 SML/NJ programs
ould be lower than SPEC92 C and For-
rsa 3% -1% 2% < 1% tran programs. Ne
ula and George also measured SML/NJ
symalg 1% 1% 1% < 1% programs [14℄, on a DEC Alpha with performan
e
oun-
gh
1% 17% 17% 27% 0.6 ters. They found that stalls
aused by data
a
he misses
gh
-O 2% 12% 14% 24% 0.6 a
ounted for 24% of exe
ution time.
The main dieren
es between this work and previous work
Figure 12: Ee
t of prefet
hing is that we have
onsidered large programs in a lazy language,
we have identied
a
he misses down to the level of individ-
ual instru
tions, we have used prefet
hing to avoid
a
he
theoreti
al and a
tual speed-ups. This is pleasing sin
e more misses, and we have also
onsidered bran
h mispredi
tions
prefet
hes are performed than ne
essary (one per
losure al- and instru
tion-level parallelism.
lo
ated/
opied, about six per
a
heline), and prefet
hing
in
reases memory bandwidth requirements. 8. FURTHER WORK AND CONCLUSION