Lecture 3

CS F342: Computer Architecture
(First Semester 2022-23)
Lect 3: Performance, MIPS Instrutcion

Dr. Nikumani Choudhury
Asst. Prof., Dept. of Computer Sc. & Information Systems
nikumani@hyderabad.bits-pilani.ac.in
BITS Pilani
Hyderabad Campus
CPI in More Detail
If different instruction classes take different numbers of cycles

$
Clock Cycles = *(CPI! ×Instruction Count ! )

!"#
n Weighted average CPI

$
Clock Cycles Instruction Count !
CPI = = * CPI! ×
Instruction Count Instruction Count
!"#
Relative frequency
2 CS F342 BITS Pilani, Hyderabad Campus

Performance Summary
The BIG Picture
Instructions Clock cycles Seconds

CPU Time = × ×
Program Instruction Clock cycle
Performance depends on
– Algorithm: affects IC, possibly CPI
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, clock rate(Tc

Uniprocessor Performance
Constrained by power, instruction-level parallelism,

memory latency
Multicore microprocessors
Multicore microprocessors
– More than one processor per chip
Requires explicitly parallel programming

– Compare with instruction level parallelism
• Hardware executes multiple instructions at once
• Hidden from the programmer
– Hard to do
• Programming for performance
• Load balancing
• Optimizing communication and synchronization

SPEC CPU Benchmark
Programs used to measure performance
– Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
– Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2017
– Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance
– Set of 10 integer benchmarks and 13 floating point benchmarks
– Summarize as geometric mean of performance ratios
$
!
! Execution time ratio!
!"#

SPEC Power Benchmark
Power consumption of server at different workload levels

– Performance: ssj_ops/sec
– Power: Watts (Joules/sec)
#% #%
Overall ssj_ops per Watt = 8 ssj_ops! 7 8 power!

!"% !"%

Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting a proportional
improvement in overall performance
T,--*./*+
T!%&'()*+ = + T0$,--*./*+
improvement factor
n Example: multiply accounts for 80s/100s

n How much improvement in multiply performance to
get 5× overall?
80 n Can’t be done!
20 = + 20
n

Amdahl’s law
In computer architecture, Amdahl’s law (or Amdahl’s Argument) is a formula which gives the theoretical speedup in
latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is
named after the computer scientist Gene Amdahl.
The speedup formula is given by ,

speedup = Execution time without enhancement = Execution Timeold
Execution Time with enhancement Execution Timenew
Execution Timenew = Execution Timeold x ((1- Fraction enhanced) + (FractionEnhanced / SpeedupEnhanced ))
overall speedup = 1/ ((1- Fraction enhanced) + (FractionEnhanced / SpeedupEnhanced ))

Example
a) Given , Speedupenhanced = 20
Fraction enhanced =0.5
overall speedup =?
overall speedup = 1/((1-0.5) + 0.5/20) = 1.105
b) Given , Speedupenhanced =16

Fraction enhanced = 0.6
overall speedup = ?
overall speedup = 1/((1-0.6) + 0.6/16) = 2.286.

13/09/22
Instruction Set
The repertoire of instructions of a computer

Different computers have different instruction sets
– But with many aspects in common
Early computers had very simple instruction sets
– Simplified implementation
Many modern computers also have simple instruction sets

The MIPS Instruction Set
Stanford MIPS commercialized by MIPS Technologies

(www.mips.com)
Typical of many modern ISAs
– See MIPS Reference Data tear-out card, and Appendixes B and E
Similar ISAs have a large share of embedded core market
– Applications in consumer electronics, network/storage equipment,
cameras, printers, …

Arithmetic Operations
Add and subtract, three operands

– Two sources and one destination
add a, b, c # a gets b + c
All arithmetic operations have this form
Design Principle 1: Simplicity favors regularity
– Regularity makes implementation simpler
– Simplicity enables higher performance at lower cost

Arithmetic Example
C code:
f = (g + h) - (i + j);
Compiled MIPS code:
add t0, g, h # temp t0 = g + h

add t1, i, j # temp t1 = i + j
sub f, t0, t1 # f = t0 - t1

Register Operands
Arithmetic instructions use register

operands
MIPS has a 32 × 32-bit register file
– Use for frequently accessed data
– Numbered 0 to 31
– 32-bit data called a “word”
Assembler names
– $t0, $t1, …, $t9 for temporary values
– $s0, $s1, …, $s7 for saved variables
Design Principle 2: Smaller is faster
– c.f. main memory: millions of locations

Register Operand Example
C code:
f = (g + h) - (i + j);
– f, …, j in $s0, …, $s4
Compiled MIPS code:
add $t0, $s1, $s2
add $t1, $s3, $s4
sub $s0, $t0, $t1

Memory Operands
Main memory used for composite data

– Arrays, structures, dynamic data
To apply arithmetic operations
– Load values from memory into registers
– Store result from register to memory
Memory is byte addressed
– Each address identifies an 8-bit byte
Words are aligned in memory
– Address must be a multiple of 4
MIPS is Big Endian
– Most-significant byte at least address of a word
– c.f. Little Endian: least-significant byte at least address

Memory Operand Example 1
C code:
g = h + A[8];
– g in $s1, h in $s2, base address of A in $s3
Compiled MIPS code:
– Index 8 requires offset of 32
• 4 bytes per word
lw $t0, 32($s3) # load word
add $s1, $s2, $t0

Memory Operand Example 2
C code:
A[12] = h + A[8];
– h in $s2, base address of A in $s3
Compiled MIPS code:
– Index 8 requires offset of 32
lw $t0, 32($s3) # load word

add $t0, $s2, $t0
sw $t0, 48($s3) # store word

Lecture 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

CS F342: Computer Architecture

(First Semester 2022-23)

Lect 3: Performance, MIPS Instrutcion

If different instruction classes take different numbers of cycles

Clock Cycles = *(CPI! ×Instruction Count ! )

n Weighted average CPI

2 CS F342 BITS Pilani, Hyderabad Campus

The BIG Picture

Instructions Clock cycles Seconds

3 CS F342 BITS Pilani, Hyderabad Campus

Constrained by power, instruction-level parallelism,

Requires explicitly parallel programming

5 CS F342 BITS Pilani, Hyderabad Campus

6 CS F342 BITS Pilani, Hyderabad Campus

Power consumption of server at different workload levels

Overall ssj_ops per Watt = 8 ssj_ops! 7 8 power!

7 CS F342 BITS Pilani, Hyderabad Campus

n Example: multiply accounts for 80s/100s

8 CS F342 BITS Pilani, Hyderabad Campus

The speedup formula is given by ,

Execution Timenew = Execution Timeold x ((1- Fraction enhanced) + (FractionEnhanced / SpeedupEnhanced ))

overall speedup = 1/ ((1- Fraction enhanced) + (FractionEnhanced / SpeedupEnhanced ))

9 CS F342 BITS Pilani, Hyderabad Campus

b) Given , Speedupenhanced =16

10 CS F342 BITS Pilani, Hyderabad Campus

The repertoire of instructions of a computer

12 CS F342 BITS Pilani, Hyderabad Campus

Stanford MIPS commercialized by MIPS Technologies

13 CS F342 BITS Pilani, Hyderabad Campus

Add and subtract, three operands

14 CS F342 BITS Pilani, Hyderabad Campus

add t0, g, h # temp t0 = g + h

15 CS F342 BITS Pilani, Hyderabad Campus

Arithmetic instructions use register

16 CS F342 BITS Pilani, Hyderabad Campus

17 CS F342 BITS Pilani, Hyderabad Campus

Main memory used for composite data

18 CS F342 BITS Pilani, Hyderabad Campus

19 CS F342 BITS Pilani, Hyderabad Campus

lw $t0, 32($s3) # load word

20 CS F342 BITS Pilani, Hyderabad Campus

You might also like