You are on page 1of 30

# Computer Architecture

1

Performance
What do you mean by performance of computer?
Two important metrics

Response Time or Latency – Time taken for

completion of a single job. Smaller is better.
Throughput – Number of jobs done per unit of time.
Larger is better.

Does one imply the other?
• Yes. Eg. If latency decreases, throughput will increase.
• No. Eg. In pipelining, latency may have be increased to
increase throughput!
2

CPU Performance Equation
CPU _ TIME  Clock _ Cycles _ Needed * Clock _ Cycle _ Time
No. _ Instructions * Clocks _ Per _ instruction
CPU _ TIME 
Clock _ Rate

What is this Response Time or Throughput??
3

Instructions can be reduced by: – Better ISA – Better Compiler – Better Algorithm • Clocks Per Instruction can be reduced by: – Better Hardware Design – Make the common case faster • Clock Rate can be increased by: – Hardware Design 4 .How can we Improve Performance ? • No.

Numerical Assignment A computer (3.06 GHz) has the following CPI Instruction Type A CPI 1 2 3 B C An algorithm may be implemented in 2 ways I1 and I2. B C Which implementation has lesser number of instructions? What is average CPI for both implementations? Which implementation is faster? What is the total time taken for executing I1 and I2? What can you say about the MIPS rating? 5 . for each implementation the number of instructions used (in million) are as follows Instruction Type A I1 0 2 2 I2 2 2 1 1. 3. 2. 4.

94 mS 6 . of Instructions I1 = 4 M No. Total Time for I1 = 10 M / 3.06 GHz = 2. Notice that number of instructions required by I1 is lesser.06 GHz = 3. 3.5 I2 = 9/5 = 1.27 mS I2 = 9 M / 3. Clocks req.1. by I2 = 2*1 + 2*2 + 1*3 = 9 M. by I1 = 2*2 + 2*3 = 10 M. of Instructions I2 = 5 M Hence I1 has lesser number of instructions 2.8 I2 is faster as it requires lesser number of clock cycles. No. Average CPI for I1 = 10/4 = 2.

4. This can be calculated from • CPI and clock rate of machine MIPS = clock rate / CPI * 10-6 • Total Execution Time and Instruction Count MIPS = Instruction Count / Total Execution Time * 10-6 MIPS rating for I1 = 1224 MIPS for I2 = 1700 MIPS MIPS rating for I2 machine > MIPS rating for I1 machine. MIPS rating = Million Instructions per second. 7 . This is as expected. since I2 has lesser execution time.

MIPS is a good metric. Total Number of instructions is definitely not a good metric.Probable Conclusions 1. 2. 8 .

B C Which implementation has lesser number of instructions? What is average CPI for both implementations? Which implementation is faster? What is the total time taken for executing I1 and I2? What can you say about the MIPS rating? 9 . 4. 3.Numerical Assignment A computer (3. 2. for each implementation the number of instructions used (in million) are as follows Instruction Type A I1 0 2 2 I2 1 2 0 1.06 GHz) has the following CPI Instruction Type A CPI 5 2 3 B C An algorithm may be implemented in 2 ways I1 and I2.

06 GHz = 3. 3. of Instructions I1 = 4 M No. Total Time for I1 = 10 M / 3.06 GHz = 2. by I1 = 2*2 + 2*3 = 10 M.5 I2 = 9/3 = 3 I2 is faster as it requires lesser number of clock cycles. Clocks req. 2. of Instructions I2 = 3 M Hence number of instructions for I1 is greater than number of instructions for I2.1. by I2 = 1*5 + 2*2 = 9 M.94 mS 10 .27 mS I2 = 9 M / 3. No. Average CPI for I1 = 10/4 = 2.

since I2 has lesser execution time. Conclusion MIPS is also not a good metric for overall system performance. This is unexpected.4. 11 . MIPS rating for I1 = 1224 MIPS for I2 = 1020 MIPS MIPS rating for I1 machine > MIPS rating for I2 machine.

MIPS 2. Total number of instructions 3. Clock Rate alone. 12 .Conclusion Total time of execution is always a better metric as it sums up all factors and can not be replaced by considering 1.

which program(s) should be used to measure performance? Benchmarks. 13 .Measuring Performance Now that we know that performance is dependent upon program.

• Types of Benchmarks – Real Programs – Kernel • Extract the key feature from a program – Component – Synthetic • Dhrystone – floating Point • Whetstone – Integer and String Arithemetic – I/O – Parallel 14 .Benchmarks • Are a set of programs that are specifically chosen for measuring performance.

15 .Challenges 1. Give data set rather than a single performance number. At-times this is permitted. 2. 3. Concentrate only on computational power. Vendors may tinker with benchmark to make them run better on their platform.

Popular Benchmarks • SPEC .Standard Performance Evaluation Corporation – – – – • Floating point Integer Web Graphics TPC – Transaction Processing Performance Council – Web Server – Transaction Processing – Decision Support Systems • BAPCo – Business Applications Performance Corporation – Popular business applications • EEMBC – Embedded Microprocessor Benchmark Consortium – Embedded Applications 16 .

Statistical Summarization of Data For Response time metric Arithmetic Mean For Throughput metric Harmonic Mean or Geometric Mean. SPEC uses Geometric Mean 17 .

Are Benchmarks enough? Benchmarks give the overall performance. Profilers do this job. it may be necessary to know about the instruction or section of program where maximum time is being spent. 18 . if one wants to optimize performance.

• Techniques used – Instruction Set Simulation – Hardware Interrupts – OS Hooks – Code instrumentation • Example. Gprof 19 .Profiling or Dynamic Program Analysis • Program behavior is analysed as it is being run. Intel Vtune.

• Beneficial for learning/improving some aspect of architecture. • Simulators available are : – Kiel – Instruction Simulator – Little Mans Simulator – Simulator of a machine – Cacheprof – Cache Simulator 20 . Simulation is cost effective.Simulation • Difficult to build the system.

Moore’s Law (1965) Moore's Law states that the number of transistors on a single chip at the same price will double every 18 to 24 months. 21 .

their speed increases. 22 . Moore’s Law in combination with various other factors like ILP (Instruction Level Parallelism) were responsible for major improvements till a long time. Or clock rate increases. hence circuits become faster.Implication? As more transistors are added to the chip of the same area.

Trends in Computing (Intel Processors) Fastest Processor Current fastest processor.20 GHz 3. 2003 2008 Intel® Processor name Pentium 4 Intel® Core™ i7-965 Processor Extreme Edition Processor speed 3. reported in Text.20GHz Processor Primary Level Cache 12KB + 8KB 4x32KB Processor secondary cache 512 KB 4x256KB Level 2 cache Processor third level cache 2 MB Unified inclusive 8MB L3 23 .

2008 Intel® Processor name Pentium 4 Intel® Core™ i7-965 Processor Extreme Edition Processor speed 3.20GHz Processor 12KB + 8KB Primary Level Processor Speed or Cache 4x32KB Clock Rate has not changed!!! Processor secondary cache 512 KB 4x256KB Level 2 cache Processor third level cache 2 MB Unified inclusive 8MB L3 24 .20 GHz 3.Observations Fastest Processor reported in Text 2003 Current fastest processor.

2008 Intel® Processor name Pentium 4 Intel® Core™ i7-965 Processor Extreme Edition Processor speed 3.Observations Fastest Processor reported in Text 2003 Current fastest processor.20GHz Processor Primary Level Cache 12KB + 8KB What is 4? 4x32KB Processor secondary cache 512 KB 4x256KB Level 2 cache Processor third level cache 2 MB Unified inclusive 8MB L3 25 .20 GHz 3.

No more ILP. Why? 1. rather than increasing clock speed. Memory Wall 3. Power Wall 2.Actually more transistors are being used to pack more cores into a chip. 26 .The Answer Multi Core Approach .

• Software – Intel Vtune or any other profiling tool – Little Mans Computer Simulator or any other simulator apart from keil. 27 .Topics for further Study • Papers – Performance papers – Memory Wall.

Amdahl’s Law Execution time after improvement = Execution time affected by improvement Amount of improvement + Execution time unaffected by improvement 28 .

If memory instructions account for 50% of total time taken. Tnew = 25 + 50 = 75. Imp = 25% 29 .What this means? Even if we substantially increase performance any one component. A new architecture increases the speed of memory instructions by 50%. What is the overall increase in performance? Told = 100. it may not result in overall substantial performance improvement.

b.What is better? a. 20% increase in perf. 30 . 90% increase in perf of instructions executing 20% of time. of instructions executing 90% of time.