You are on page 1of 18

Advanced Computer Architecture

We will consider issues in current Architecture design and implementation:


RISC instruction sets Pipelining Instruction-level parallelism Block-level parallelism Thread-level parallelism Multiprocessors Improving cache performance Optimizing virtual memory usage

In CSC 362, we focused on


the roles of the components in the architecture the structure of the architecture (how things connect together)

Here, we focus on
Using available technology to improve computer performance Using quantitative measures to test architectural ideas Using a RISC instruction set for examples Discussing a variety of software and hardware techniques to provide optimization Attempting to force as much parallelism out of the code as possible

Measuring Performance
We might use one of the following terms to measure performance
MIPS, MegaFLOPS
neither of these terms tells us how the processor performs on the other type of operation

Clock Speed (GHZ rating)


misleading as we will explore throughout the semester

Execution time
worthwhile on an unloaded system

Throughput
number of programs / unit time useful for servers and large systems

Wall-clock time CPU time, user CPU time, system CPU time
CPU time = user CPU time + system CPU time

System performance
on an unloaded system

note: CPU performance = 1 / execution time

What does it mean that one computer is faster than another?

X is n times faster than Y means


Exec time Y / Exec time X = n Perf X / Perf Y = n

Meaning of Performance

Example:
if throughput of X is 1.3 times higher than Y
then the number of tasks that can be executed on X is 1.3 more than on Y in the same amount of time

Example:
X executes program p1 in .32 seconds Y executes program p1 in .38 seconds X is .38 / .32 = 1.19 times faster
19% faster

To validly compare two computers performance, we must compare performance on the same program Additionally, computers may have better performances on different programs
e.g., C1 runs P1 faster than C2 but C2 runs P2 faster than C1 we might use weighted averages or geometric means, as well as distributions, to derive a single processors overall performance (see pages 34-37 if you are interested)

Benchmarks
Four levels of programs can be used to test performance
Real programs
e.g., C compiler, CAD tool programs with input, output, options that the user selects

A benchmark suite is a set of programs that test different performance metrics


Example: test array capabilities, floating point operations, loops, SPEC benchmark suites are commonly cited SPEC 96 is the most recent benchmark, see figure 1.13 on page 31

Kernels
remove key pieces of programs and just test those

Toy benchmarks
10-100 lines of code such as quicksort whose performance is known in advance

Synthetic benchmarks
try to match average frequency of operations to simulate larger programs

Reporting benchmark results must include


compiler settings and version input OS number/size of disks

Only real programs are used today


These others have been discredited since computer architects and compiler writers will optimize systems to perform well on these specific benchmarks/kernels

Results must be reproducible

Principles of Computer Design


As computer architecture research has progressed, several key design concepts have been identified
The goal today is to further exploit each of these because they provide a great deal of performance speed up We will examine these and use a quantitative approach to identify the extent of the speedup
Take advantage of parallelism
Using multiple hardware components (ALU functional units, memory modules, register ports, disk drives, etc) we can attempt to execute instructions and threads in parallel

Principle of locality of reference


Used to design memory systems so that we can attempt to keep in cache the data and instructions that will most likely be referenced soon

Focus on the common case


As we see next, if we can achieve a small speedup for executing the common case, it is better than achieving a large speedup for an uncommon case

Amdahls Law
In order to explore architectural This law uses two factors: improvements, we need a mechanism Fraction of the computation time in the original machine to gage the speedup of our that can be converted to improvements take advantage of the Amdahls Law allows us to compute enhancement (F) speedup that can be gained by using a Improvement gained by the particular feature as follows enhanced execution mode (how much faster will the Given an enhancement E task run if the enhanced
Speedup = performance with E / performance without E or Speedup = execution time without E / execution time with E mode is used for the entire program?) (S)

Speedup = 1 / (1 F + F / S)

Example 1:

Examples

Web server is to be enhanced


new CPU is 10 times faster on computation than old CPU the original CPU spent 40% of its time processing and 60% of its time waiting for I/O

What will the speedup be?


Fraction enhancement used = 40% Speedup in enhanced mode = 10 Speedup = 1 / [(1 - .4) + .4/10] = 1.56

Example 2:
A benchmark consists of:
20% FP sqrt 50% FP operations (including sqrt) 50% other operations

Enhancement options are:


add FP sqrt hardware to speed up sqrt performance by a factor of 10 enhance all FP operations by a factor of 1.6

Speedup FP sqrt = 1/[(1-.2) + .2/10] = 1.22 Speedup all FP = 1/[(1-.5) + .5/1.6] = 1.23 The enhancement to support the common case is (slightly) better

CPU Performance Formulae


CPU time = CPU clock cycles * clock cycle time
CPU clock cycles the number of clock cycles that elapse during the execution of the given program clock cycle time is the reciprocal of the clock rate that is, how much time elapses for one clock cycle, which gives us:

CPU time = CPU clock cycles for prog / clock rate CPU time = IC * CPI * Clock cycle time
IC - instruction count (number of instructions) CPI - clock cycles per instruction IC * CPI = CPU clock cycles

CPI = CPU clock cycles / IC CPU time = (7 CPIi * ICi) * clock cycle time Average CPI = 7(CPIi * ICi) / Total Instruction Count
In the latter equation, CPIi and ICi are for each type of operation (for instance, the CPI and number of adds, the CPI and number of loads, )

Example
Assume:
frequency of FP operations = 25% (other than sqrt) and frequency of FP sqrt = 2% average CPI of FP operations = 4.0, CPI of FP sqrt = 20 average CPI other instr = 1.33 CPI = 4*25%+1.33*75% = 2.0

Two alternatives:
reduce CPI of FP sqrt to 2 or reduce average CPI of all FP ops (including sqrt) to 2.5
CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64 CPI new FP = 75%*1.33+25%*2.5=1.625

Speedup new FP = CPI original/CPI new FP =1.64 / 1.625 = 1.23

Computing Speedup which formula?


We can compute speedup by
determining the difference in CPU time before and after an enhancement or by using Amdahls Law

Which should we use?


the formulas are the same lets demonstrate this with an example:

Benchmark consists of 35% loads, 15% stores, 40% ALU operations and 10% branches
CPI for each instruction is 5 for loads and stores and 4 for ALU and branches (since this is an integer benchmark, the floating point registers are not used) Consider that we could keep more values in registers by moving them to floating point registers rather than storing and then reloading these values in memory

Lets have the compiler replace some of the loads/stores with register moves
this enhancement is done by the compiler, so costs us nothing! assuming that the compiler can reduce 20% of the loads from the program, how worthwhile is it?

Solution
We change some loads/stores to ALU operations
so overall CPI goes down, IC remains the same

Solution 1: compute CPU Time differences


CPU Time = IC * CPI * CPU Clock Rate CPIold = 50% * 5 + 50% * 4 = 4.5 CPInew = 40% * 5 + 60% * 4 = 4.4 Since IC and CPU Clock Rate have not changed, speedup is only CPIold / CPInew Speedup = 4.5 / 4.4 = 1.0227 = 2.27% speedup

Solution 2: Amdahls Law


Speedup of enhanced mode is from 5 cycles to 4 cycles or 5/4 = 1.25 Fraction used = fraction of the execution time where we use conversions instead of loads/stores
overall CPI is 4.5 enhancement used on 20% of loads/stores 20% * 50% * 5 = .5 clock cycles out of 4.5, or .5 / 4.5 = 11.1% of the time

Amdahls Law = 1 / [1 F + F / S] = 1 / [1 - .111 + .111 / 1.25] =

1 / .9778 = 1.0227 = 2.27% speedup

Why MIPS Can Be Misleading


Assume a load-store machine MIPS = IC / (Execution Time * 106) exec time = IC * CPI / Clock Cycle rate with a breakdown of
43% ALU 21% load/store 24% branch CPI = 1 for ALU operations CPI = 2 for all other operations Optimizing compiler is able to discard 50% of ALU operations so, MIPS = clock rate / (CPI * 106) CPIunoptimized = 1.57 MIPSunoptimized = 500 MHz / (1.57 * 106) = 318.5 CPIoptimized = (.43 / 2 * 1 + .57 * 1) / (1 .43 / 2) = 1.73 MIPSoptimized = 500 MHz / (1.73 * 106) = 289.0
The optimized program will execute faster because it has fewer instructions, but its CPI is larger because a greater portion of the instructions have a higher CPI, and therefore its MIPS rating is lower

Ignoring system issues, if the machine has a 2 nanosecond clock cycle (500 MHz) and 1.57 unoptimized CPI,

what is the MIPS rating for the optimized and unoptimized So, MIPS and execution time are not versions? does the MIPS value directly related! agree with the execution time?

Sample Problem #1
Consider adding register-memory ALU instructions to a machine that previously only permitted register-register ALU operations Assume a benchmark with the following breakdown of operations is used to test this enhancement:
ALU operations: 43%, CPI = 1 Loads: 21%, CPI = 2 Stores: 12%, CPI = 2 Branches: 24%, CPI = 2

The new ALU register-memory operation has the following consequences:


ALU register-memory operations have CPI = 2 and Branches now have a CPI = 3

But, 25% of data loaded are only used once so that the new ALU register-memory instruction can be used in place of the load + ALU operation Is it worth it?

Solution
CPIold = .43 * 1 + .57 * 2 = 1.57
3 changes:
some ALU operations use new mode which changes their CPI fewer loads all branches have higher CPI

CPInew = .11 * 2 + .13 * 2 + .27 * 3 + .48 * (.25 * 2 + .75 * 1) = 1.89 CPU Time = IC * CPI * Clock Cycle Rate
Clock Cycle Rate remains unchanged CPI has been recomputed IC in the new system is 89% of the old system CPUold = IC * 1.57 * CCR CPUnew = .89 * IC * 1.89 * CCR Speedup = 1.57 / (.89 * 1.89) = .934
this is a slowdown, so this enhancement is not an improvement!

We have a new distribution:


25% of ALU operations become ALU-memory operations
25% * 43% = 11%, so we remove this many loads giving us 89% as many instructions as previously

Loads: [21% - (25% * 43%) ] / 89% = 11% Stores: 12% / 89% = 13% ALU operations: 43% / 89% = 48% Branches: 24% / 89% = 27%

Sample Problem #2
Assume a machine with a perfect cache
And the following instruction mix breakdown:
ALU: 43%, CPI 1 Loads: 21%, CPI 2 Stores: 12%, CPI 2 Branches: 24%, CPI 2

CPIperfectcachemachine = .43 * 1 + .57 * 2 = 1.57


Because of cache misses, we have to compute the CPI for all new instructions based on misses during instruction fetch (5%) and misses during data accesses (10%) where a miss adds 40 cycles to the CPI

An imperfect cache has a CPIimperfectcachemachine = .43 * (1 + miss rate of 5% for .05 * 40) + .21 * (2 + .05 * 40 + instructions and 10% for .10 * 40) + .12 * (2 + .05 * 40 + data and a miss penalty of 40 .10 * 40) + .24 * (2 + .05 * 40) = cycles

How much faster is the machine with the perfect cache?

4.89 Perfect machine = 4.89 / 1.57 = 3.11 times faster

Sample Problem #3
Architects are considering one of two enhancements for their processor
#1 can be used 20% of the time and offers a speedup of 3 #2 offers a speedup of 7

What fraction of the time will the second enhancement have to be used in order to achieve the same overall speedup as the first enhancement?
speedup from #1 = 1 / [(1 - .2) + .2 / 3] = 1.154

So, for the second enhancement to match, we have 1.154 = 1 / [(1 x) + x / 7] and we must solve for x
using some algebra, we get: 1.154 = 1 / (1 7x / 7 + x / 7) = 1 / (1 6x / 7) = 1 / (7 6x) / 7 = 7 / (7 6x) or 7 6x = 7 / 1.154 6x = 7 7 / 1.154 = 0.934, or x = 0.934 / 6 = 0.156.

Sample Problem #4
We will compare a CISC machine and a RISC machine on a benchmark
The machines have the following characteristics
CISC machine has CPIs of
4 for load/store, 3 for ALU/branch, 10 for call/return CPU clock rate of 1.75 GHz

RISC machine has a CPI of 1.2 (as it is pipelined) and a CPU clock rate of 1 GHz CISC machine uses complex instructions so the CISC version of the benchmark is 40% less than the same benchmark on the RISC machine (that is, CISC IC is 40% less than RISC IC) 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns, and 11% branches

The benchmark has a breakdown of:

Which machine will run the benchmark in less time?

Solution

We compare the CPU time for both machines


CPU time = IC * CPI / Clock rate

Since both machines have GHz in their clock rate, to simplify, we will drop the GHz value CISC machine:
First, compute the CISC machines CPI given the individual CPI for the machine and the benchmarks breakdown of instructions:
CPI = 4 * (.38 + .10) + 3 * (.35 + .11) + 10 * (.03 + .03) = 3.9

CPU time CISC = IC CISC * 3.9 / 1.75

RISC machine:
IC * 1.2 / 1 = IC RISC * 1.2 Recall that the CISC machine has 40% fewer instructions, so IC CISC = .6 * IC RISC

CPU time CISC = .6 * IC RISC * 3.9 / 1.75 = 1.34 IC RISC CPU time RISC = 1.2 IC RISC Since the RISC CPU time is smaller, it is faster by 1.34 / 1.2 = 1.12 or 12% faster

You might also like