Professional Documents
Culture Documents
Here, we focus on
Using available technology to improve computer performance Using quantitative measures to test architectural ideas Using a RISC instruction set for examples Discussing a variety of software and hardware techniques to provide optimization Attempting to force as much parallelism out of the code as possible
Measuring Performance
We might use one of the following terms to measure performance
MIPS, MegaFLOPS
neither of these terms tells us how the processor performs on the other type of operation
Execution time
worthwhile on an unloaded system
Throughput
number of programs / unit time useful for servers and large systems
Wall-clock time CPU time, user CPU time, system CPU time
CPU time = user CPU time + system CPU time
System performance
on an unloaded system
Meaning of Performance
Example:
if throughput of X is 1.3 times higher than Y
then the number of tasks that can be executed on X is 1.3 more than on Y in the same amount of time
Example:
X executes program p1 in .32 seconds Y executes program p1 in .38 seconds X is .38 / .32 = 1.19 times faster
19% faster
To validly compare two computers performance, we must compare performance on the same program Additionally, computers may have better performances on different programs
e.g., C1 runs P1 faster than C2 but C2 runs P2 faster than C1 we might use weighted averages or geometric means, as well as distributions, to derive a single processors overall performance (see pages 34-37 if you are interested)
Benchmarks
Four levels of programs can be used to test performance
Real programs
e.g., C compiler, CAD tool programs with input, output, options that the user selects
Kernels
remove key pieces of programs and just test those
Toy benchmarks
10-100 lines of code such as quicksort whose performance is known in advance
Synthetic benchmarks
try to match average frequency of operations to simulate larger programs
Amdahls Law
In order to explore architectural This law uses two factors: improvements, we need a mechanism Fraction of the computation time in the original machine to gage the speedup of our that can be converted to improvements take advantage of the Amdahls Law allows us to compute enhancement (F) speedup that can be gained by using a Improvement gained by the particular feature as follows enhanced execution mode (how much faster will the Given an enhancement E task run if the enhanced
Speedup = performance with E / performance without E or Speedup = execution time without E / execution time with E mode is used for the entire program?) (S)
Speedup = 1 / (1 F + F / S)
Example 1:
Examples
Example 2:
A benchmark consists of:
20% FP sqrt 50% FP operations (including sqrt) 50% other operations
Speedup FP sqrt = 1/[(1-.2) + .2/10] = 1.22 Speedup all FP = 1/[(1-.5) + .5/1.6] = 1.23 The enhancement to support the common case is (slightly) better
CPU time = CPU clock cycles for prog / clock rate CPU time = IC * CPI * Clock cycle time
IC - instruction count (number of instructions) CPI - clock cycles per instruction IC * CPI = CPU clock cycles
CPI = CPU clock cycles / IC CPU time = (7 CPIi * ICi) * clock cycle time Average CPI = 7(CPIi * ICi) / Total Instruction Count
In the latter equation, CPIi and ICi are for each type of operation (for instance, the CPI and number of adds, the CPI and number of loads, )
Example
Assume:
frequency of FP operations = 25% (other than sqrt) and frequency of FP sqrt = 2% average CPI of FP operations = 4.0, CPI of FP sqrt = 20 average CPI other instr = 1.33 CPI = 4*25%+1.33*75% = 2.0
Two alternatives:
reduce CPI of FP sqrt to 2 or reduce average CPI of all FP ops (including sqrt) to 2.5
CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64 CPI new FP = 75%*1.33+25%*2.5=1.625
Benchmark consists of 35% loads, 15% stores, 40% ALU operations and 10% branches
CPI for each instruction is 5 for loads and stores and 4 for ALU and branches (since this is an integer benchmark, the floating point registers are not used) Consider that we could keep more values in registers by moving them to floating point registers rather than storing and then reloading these values in memory
Lets have the compiler replace some of the loads/stores with register moves
this enhancement is done by the compiler, so costs us nothing! assuming that the compiler can reduce 20% of the loads from the program, how worthwhile is it?
Solution
We change some loads/stores to ALU operations
so overall CPI goes down, IC remains the same
Ignoring system issues, if the machine has a 2 nanosecond clock cycle (500 MHz) and 1.57 unoptimized CPI,
what is the MIPS rating for the optimized and unoptimized So, MIPS and execution time are not versions? does the MIPS value directly related! agree with the execution time?
Sample Problem #1
Consider adding register-memory ALU instructions to a machine that previously only permitted register-register ALU operations Assume a benchmark with the following breakdown of operations is used to test this enhancement:
ALU operations: 43%, CPI = 1 Loads: 21%, CPI = 2 Stores: 12%, CPI = 2 Branches: 24%, CPI = 2
But, 25% of data loaded are only used once so that the new ALU register-memory instruction can be used in place of the load + ALU operation Is it worth it?
Solution
CPIold = .43 * 1 + .57 * 2 = 1.57
3 changes:
some ALU operations use new mode which changes their CPI fewer loads all branches have higher CPI
CPInew = .11 * 2 + .13 * 2 + .27 * 3 + .48 * (.25 * 2 + .75 * 1) = 1.89 CPU Time = IC * CPI * Clock Cycle Rate
Clock Cycle Rate remains unchanged CPI has been recomputed IC in the new system is 89% of the old system CPUold = IC * 1.57 * CCR CPUnew = .89 * IC * 1.89 * CCR Speedup = 1.57 / (.89 * 1.89) = .934
this is a slowdown, so this enhancement is not an improvement!
Loads: [21% - (25% * 43%) ] / 89% = 11% Stores: 12% / 89% = 13% ALU operations: 43% / 89% = 48% Branches: 24% / 89% = 27%
Sample Problem #2
Assume a machine with a perfect cache
And the following instruction mix breakdown:
ALU: 43%, CPI 1 Loads: 21%, CPI 2 Stores: 12%, CPI 2 Branches: 24%, CPI 2
An imperfect cache has a CPIimperfectcachemachine = .43 * (1 + miss rate of 5% for .05 * 40) + .21 * (2 + .05 * 40 + instructions and 10% for .10 * 40) + .12 * (2 + .05 * 40 + data and a miss penalty of 40 .10 * 40) + .24 * (2 + .05 * 40) = cycles
Sample Problem #3
Architects are considering one of two enhancements for their processor
#1 can be used 20% of the time and offers a speedup of 3 #2 offers a speedup of 7
What fraction of the time will the second enhancement have to be used in order to achieve the same overall speedup as the first enhancement?
speedup from #1 = 1 / [(1 - .2) + .2 / 3] = 1.154
So, for the second enhancement to match, we have 1.154 = 1 / [(1 x) + x / 7] and we must solve for x
using some algebra, we get: 1.154 = 1 / (1 7x / 7 + x / 7) = 1 / (1 6x / 7) = 1 / (7 6x) / 7 = 7 / (7 6x) or 7 6x = 7 / 1.154 6x = 7 7 / 1.154 = 0.934, or x = 0.934 / 6 = 0.156.
Sample Problem #4
We will compare a CISC machine and a RISC machine on a benchmark
The machines have the following characteristics
CISC machine has CPIs of
4 for load/store, 3 for ALU/branch, 10 for call/return CPU clock rate of 1.75 GHz
RISC machine has a CPI of 1.2 (as it is pipelined) and a CPU clock rate of 1 GHz CISC machine uses complex instructions so the CISC version of the benchmark is 40% less than the same benchmark on the RISC machine (that is, CISC IC is 40% less than RISC IC) 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns, and 11% branches
Solution
Since both machines have GHz in their clock rate, to simplify, we will drop the GHz value CISC machine:
First, compute the CISC machines CPI given the individual CPI for the machine and the benchmarks breakdown of instructions:
CPI = 4 * (.38 + .10) + 3 * (.35 + .11) + 10 * (.03 + .03) = 3.9
RISC machine:
IC * 1.2 / 1 = IC RISC * 1.2 Recall that the CISC machine has 40% fewer instructions, so IC CISC = .6 * IC RISC
CPU time CISC = .6 * IC RISC * 3.9 / 1.75 = 1.34 IC RISC CPU time RISC = 1.2 IC RISC Since the RISC CPU time is smaller, it is faster by 1.34 / 1.2 = 1.12 or 12% faster