You are on page 1of 6

Performance in General

CSE 675.02: Introduction to Computer Architecture • "X is n times faster than Y" means:
Performance (X)
n = ––––––––––––––
Performance Performance (Y)

of Computer Systems • Speed of Concorde vs. Boeing 747

• Throughput of Boeing 747 vs. Concorde


Presentation C
• Cost is also an important parameter in the equation, which is
why Concorde was put to pasture!
• The Bottom Line:
slides by Gojko Babi
Performance and Cost or Cost and Performance?

Studying Assignment: Chapter 4


g. babic Presentation C 2

Basic Performance Metrics Computer Performance: Introduction


• Response time: the time between the start and the completion • The computer user is interested in response time (or execution
of a task (in time units) time) – the time between the start and completion of a given
task (program).
• Throughput: the total amount of tasks done in a given time
period (in number of tasks per unit of time) • The manager of a data processing center is interested in
throughput – the total amount of work done in given time.
• Example: Car assembly factory:
• The computer user wants response time to decrease, while
– 4 hours to produce a car (response time), the manager wants throughput increased.
– 6 cars per an hour produced (throughput)
• Main factors influencing performance of computer system are:
In general, there is no relationship between those two metrics, – processor and memory,
– throughput of the car assembly factory may increase to 18 – input/output controllers and peripherals,
cars per an hour without changing time to produce one car. – compilers, and
– How? – operating system.
g. babic Presentation C 3 g. babic Presentation C 4

CPU Time or CPU Execution Time Analysis of CPU Time


CPU time depends on the program which is executed,
CPU time (or CPU Execution time) is the time between the start
including:
and the end of execution of a given program. This time
– the number of instructions executed,
accounts for the time CPU is computing the given program,
including operating system routines executed on the program’s – types of instructions executed and their frequency of usage.
behalf, and it does not include the time waiting for I/O and Computers are constructed is such way that events in hardware
running other programs. are synchronized using a clock.
Clock rate is given in Hz (=1/sec).
• CPU time is a true measure of processor/memory performance. A clock rate defines durations of discrete time intervals called
clock cycle times or clock cycle periods:
• Performance of processor/memory = 1 / CPU_time – clock_cycle_time = 1/clock_rate (in sec)
Thus, when we refer to different instruction types (from perfor-
mance point of view), we are referring to instructions with
different number of clock cycles required (needed) to execute.
g. babic Presentation C 5 g. babic Presentation C 6

1
CPU Time Equation Calculating Components of CPU time
• For an existing processor it is easy to obtain the CPU time (i.e.
• CPU time = Clock cycles for a program * Clock cycle time the execution time) by measurement, and the clock rate is
= Clock cycles for a program / Clock rate known. But, it is difficult to figure out the instruction count or
Clock cycles for a program is a total number of clock cycles CPI.
needed to execute all instructions of a given program.
Newer processors, MIPS processor is such an example,
• CPU time = Instruction count * CPI / Clock rate include counters for instructions executed and for clock cycles.
Those can be helpful to programmers trying to understand and
Instruction count is a number of instructions executed, tune the performance of an application.
sometimes referred as the instruction path length.
CPI – the average number of clock cycles per instruction (for a
• Also, different simulation techniques and queuing theory could
given execution of a given program) is an important parameter
be used to obtain values for components of the execution
given as:
(CPU) time.
CPI = Clock cycles for a program / Instructions count
g. babic Presentation C 7 g. babic Presentation C 8

Analysis of CPU Performance Equation Calculating CPI


The table below indicates frequency of all instruction types execu-
• CPU time = Instruction count * CPI / Clock rate ted in a “typical” program and, from the reference manual, we are
provided with a number of cycles per instruction for each type.
• How to improve (i.e. decrease) CPU time: Instruction Type Frequency Cycles
– Clock rate: hardware technology & organization,
ALU instruction 50% 4
– CPI: organization, ISA and compiler technology, Load instruction 30% 5
– Instruction count: ISA & compiler technology. Store instruction 5% 4
Branch instruction 15% 2
Many potential performance improvement techniques primarily
improve one component with small or predictable impact on the CPI = 0.5*4 + 0.3*5 + 0.05*4 + 0.15*2 = 4 cycles/instruction
other two.

g. babic Presentation C 9 g. babic Presentation C 10

CPU Time: Example 1 CPU Time: Example 1 (continued)


• a. Approach 1:
Consider an implementation of MIPS ISA with 500 MHz clock and
– each ALU instruction takes 3 clock cycles, Clock cycles for a program = (x3 + y2 + z4 + w5)
– each branch/jump instruction takes 2 clock cycles, = 910  106 clock cycles
– each sw instruction takes 4 clock cycles, CPU_time = Clock cycles for a program / Clock rate
– each lw instruction takes 5 clock cycles. = 910  106 / 500  106 = 1.82 sec

Also, consider a program that during its execution executes: • b. Approach 2:


– x=200 million ALU instructions CPI = Clock cycles for a program / Instructions count
– y=55 million branch/jump instructions CPI = (x3 + y2 + z4 + w5)/ (x + y + z + w)
– z=25 million sw instructions = 3.03 clock cycles/ instruction
– w=20 million lw instructions
CPU time = Instruction count  CPI / Clock rate
Find CPU time. Assume sequentially executing CPU. = (x+y+z+w)  3.03 / 500  106
= 300  106  3.03 /500  106
= 1.82 sec
g. babic Presentation C 11 g. babic Presentation C 12

2
CPU Time: Example 2 Calculating CPI
Consider another implementation of MIPS ISA with 1 GHz clock
and
CPI = (x3 + y2 + z4 + w5)/ (x + y + z + w)
– each ALU instruction takes 4 clock cycles,
– each branch/jump instruction takes 3 clock cycles,
– each sw instruction takes 5 clock cycles,
– each lw instruction takes 6 clock cycles. The calculation may not be necessary correct (and usually it isn’t)
Also, consider the same program as in Example 1. since the numbers of cycles per instruction given don’t account
Find CPI and CPU time. Assume sequentially executing CPU. for pipeline effects and other advanced design techniques.
---------------------------------------------------------------------------------
CPI = (x4 + y3 + z5 + w6)/ (x + y + z + w)
= 4.03 clock cycles/ instruction
CPU time = Instruction count  CPI / Clock rate
= (x+y+z+w)  4.03 / 1000  106
= 300  106  4.03 /1000  106
= 1.21 sec
g. babic Presentation C 13 g. babic Presentation C 14

Phases in MIPS Instruction Execution Sequential Execution of 3 LW Instructions


• Assumed are the following delays: Memory access = 2 nsec,
• We can divide the execution of an instruction into the ALU operation = 2 nsec, Register file access = 1 nsec;
following five stages:
P ro g ra m
e x e c u t io n 2 4 6 8 10 12 14 16 18
– IF: Instruction fetch o rd er
T im e

( i n in s tr u c t i o n s )
– ID: Instruction decode and register fetch lw r 1 , 1 0 0 ( r 4 )
Instruction
Reg ALU
Data
R eg
fetch access
– EX: Execution, effective address or branch calculation lw r 2 , 2 0 0 ( r 5 )
Instruction Data
8 ns Reg ALU Reg
fetch access
– MEM: Memory access (for lw and sw instructions only) Instruction
lw r 3 , 3 0 0 ( r 6 ) 8 ns
fetch
...
– WB: Register write back (for ALU and lw instructions) 8 ns
Every lw instruction needs 8 nsec to execute.
In this course, we shall design a processor that executes
instructions sequentially, i.e. as illustrated here.
g. babic Presentation C 15 g. babic Presentation C 16

Pipelining: Its Natural! Sequential Laundry


• Modern processors use advanced hardware design 6 PM 7 8 9 10 11 Midnight
techniques, such as pipelining and out of order execution. Time

• Dave has four loads of clothes 30 40 20 30 40 20 30 40 20 30 40 20


A B C D
to wash, dry, and fold T
a A
s
• Washer takes 30 minutes k
B
O
r
d C
• Dryer takes 40 minutes e
r
D
Sequential laundry takes 6 hours for 4 loads;
• “Folder” takes 20 minutes
If Dave learned pipelining, how long would laundry take?
g. babic Presentation C 17 g. babic Presentation C 18

3
Pipelined Laundry Pipeline Executing 3 LW Instructions
• Assuming delays as in the sequential case and pipelined
6 PM 7 8 9 10 11 Midnight processor with a clock cycle time of 2 nsec.
P ro g ra m
Time e x e c u t io n 2 4 6 8 10 12 14
T im e
o rd er

30 40 40 40 40 20 ( in in s t r u c t i o n s )
lw r 1 , 1 00 (r 4 )
Instruction Data
Reg ALU Reg
fetch access

T A lw r 2 , 2 00 (r 5 ) 2 ns
Instruction
Reg ALU
Data
Reg
fetch access
a
s lw r 3 , 3 00 (r 6 ) 2 ns
Instruction
Reg ALU
Data
Reg
k B fetch access

2 ns 2 ns 2 ns 2 ns 2 ns
O
r
C Note that registers are written during the first part of a cycle and
d read during the second part of the same cycle.
e D
r • Pipelining doesn’t help to execute a single instruction, it may
Pipelined laundry takes 3.5 hours for 4 loads; improve performance by increasing instruction throughput;
g. babic Presentation C 19 g. babic Presentation C 20

Quantitative Performance Measures Quantitative Performance Measures (continued)


• The original performance measure was time to perform an • Another popular, misleading and essentially useless measure
individual instruction, e.g. add. was peak MIPS. That is a MIPS obtained using an instruction
• Next performance measure was the average instruction time, mix that minimizes the CPI, even if that instruction mix is totally
obtained from the relative frequency of instructions in some impractical. Computer manufacturers still occasionally announ-
typical instruction mix and times to execute each instruction.
Since instruction sets were similar, this was a more accurate ce products using peak MIPS as a metric, often neglecting to
comparison. include the work “peak”.
• One alternative to execution time as the metric was MIPS –
Million Instructions Per Second. For a given program MIPS • Another popular alternative to execution time was Million
rating is simple: Floating Point Operations Per Second – MFLOPS:
Instruction count Clock rate Number of floating point operations in a program
MIPS rating = –––––––––––––– = ––––––––– MFLOPS = ––––––––––––––––––––––––––––––––––––––––
CPU time * 106 CPI * 106 Execution time * 106
The problems with MIPS rating as a performance measure: Because it is based on operations in the program rather than
– difficult to compare computers with different instruction sets, on instructions, MFLOPS has a stronger claim than MIPS to
– MIPS varies between programs on the same computer, being a fair comparison between different machines. MFLOPS
– MIPS can vary inversely with performance! are not applicable outside floating-point performance.
g. babic Presentation C 21 g. babic Presentation C 22

Benchmark Suites SPEC Benchmark Suites


Collections of “representative” programs
used to measure the performance of processors. • The SPEC benchmarks are real programs, modified for
Benchmarks could be: portability and to minimize the role of I/O in overall benchmark
– real programs; performance. Example: Optimizer GNU C compiler.
– modified (or scripted) applications; • First in 1989, SPEC89 was introduced with 4 integer programs
– kernels – small, key pieces from real programs; and 6 floating point programs, providing a single “SPECmarks”.
– synthetic benchmarks – not real programs, but codes try to
match the average frequency of operations and operands of • SPEC92 had 5 integer programs and 14 floating point
a large set of programs. programs, and provided SPECint92 and SPECfp92.
Examples: Whetstone and Dhrystone benchmarks;
• SPEC95 provided SPECint_base95, SPECfp_base95.
• SPEC (Standard Performance Evaluation Corporation) was
founded in late 1980s to try to improve the state of bench- • SPEC CPU2000 has 12 integer benchmarks and 14 floating
marking and make more valid base for comparison of desk point benchmarks, and provides CINT2000 and CFP2000.
top and server computers.
g. babic Presentation C 23 g. babic Presentation C 24

4
Summarizing Performance Summarizing SPEC CPU2000 Performance
• The arithmetic mean of the execution times is given as: SPEC CPU2000 summarizes performance using a geometric
n mean of ratios, with larger numbers indicating higher

1
–* Timei performance.
n i=1
CINT2000 is indicator of integer performance and it is given as:
where Timei is the execution time for the ith program of a 12
total of n in the workload (benchmark). 12

• The weighted arithmetic mean of execution times is given as:


CINT2000 = k1 
i=1
CPU time basei/CPU timei
n

 Weight
i=1
i * Timei where k1 is a coefficient and CPU timei is the CPU time for the
ith integer program of a total of 12 programs in the workload.
where Weighti is the frequency of the ith program in the
workload. Similarly for floating point performance, CFP2000 is given as:
14
• The geometric mean of execution times is given as: 14
n
n
n CFP2000 = k2   FPExec time basei/FPExec timei
 Timei where 
i=1
xi = x1 * x2 * x3* … * xn i=1
i=1
g. babic Presentation C 25 g. babic Presentation C 26

Performance Example (part 1/5) Performance Example (part 2/5)


Note: This example is equivalent to Exercises 4.35, 4.36 and • Machine with No Floating Point Hardware - MNFP does
4.37 in the textbook. not support real number instructions, but all its
instructions are identical to non-real number instructions
• We are interested in two implementations of two similar of MFP. Each MNFP instruction (including integer
but still different ISA, one with and one without special real instructions) takes 2 clock cycles. Thus, MNFP is
number instructions. identical to MFP without real number instructions.
• Both machines have 1000MHz clock. • Any real number operation (in a program) has to be
• Machine With Floating Point Hardware - MFP implements emulated by an appropriate software subroutine (i.e.
real number operations directly with the following compiler has to insert an appropriate sequence of
characteristics: integer instructions for each real number operation). The
– real number multiply instruction requires 6 clock cycles number of integer instructions needed to implement each
– real number add instruction requires 4 clock cycles real number operations is as follows:
– real number divide instruction requires 20 clock cycles – real number multiply needs 30 integer instructions
Any other instruction (including integer instructions) – real number add needs 20 integer instructions
requires 2 clock cycles – real number divide needs 50 integer instructions
g. babic Presentation C 27 g. babic Presentation C 28

Performance Example (part 3/5) Performance Example (part 4/5)


Consider Program P with the following mix of operations:
– real number multiply 10% b. If Program P on MFP needs 300,000,000 instructions, find
– real number add 15% time to execute this program on each machine.
– real number divide 5%
MFP Number of MNFP Number of
– other instructions 70% instructions instructions
a. Find MIPS rating for both machines. real mul 30106 900106
CPIMFP = 0.16 + 0.154 + 0.0520 + 0.72 real add 45106 900106
= 3.6 clocks/instr real div 15106 750106
CPIMNFP = 2 others 210106 210106
clock rate 1000*106 Totals 300106 2760106
MIPSMFP rating = -------------- = ----------- = 277.777
CPI * 106 3.6*106 CPU_timeMFP = 300106  3.6 / 1000  106 = 1.08 sec
MIPSMNFP rating = 500
CPU_timeMNFP = 2760106  2 / 1000  106 = 5.52 sec
According to MIPS rating, MNFP is better than MFP !?
g. babic Presentation C 29 g. babic Presentation C 30

5
Performance Example (part 5/5)
c. Calculate MFLOPS for both computers.
Number of floating point operations in a program
MFLOPS = ––––––––––––––––––––––––––––––––––––––––
Execution time * 106

MFLOPSMFP = 90106 / 1.08106 = 83.3

MFLOPSMNFP = 90106 / 5.52  106 = 16.3

g. babic Presentation C 31

You might also like