Aca 1

What This Course Is About?
Understand modern computer architecture so we can:

 Write better programs
 Understand the performance implications (impacts) of algorithms,
data structures, and programming language choices
 Write better compilers
 Modern computers need better optimizing compilers and better
programming languages
 Write better operating systems
 Need to re-evaluate the current assumptions and tradeoffs
 Example: gigabit networks
 Design better computer architectures
 There are still many challenges left
 Example: the CPU-Memory gap (Memory Wall Problem)
 Satisfy the Distribution Requirement
1
Design, Organization & Architecture
 Computer Design
 Using components, i.e. what type of components will be required
and how many components will be needed etc is the scope of
Computer Design and is actually the responsibility of Computer
Designer.
 Computer Design is concerned with the logical design of the
computer according to the specifications.
 Computer Design specifies that how basic digital circuitry will be
established.
 Computer Organization
 Interconnecting different designed devices that may consist of
many components is the field of Computer Organization and will
be performed by Computer Organizers.
 Computer Architecture
 Computer Architecture sits at the top of Computer Organization
and Design.
 Computer Architecture is the study of organizational and design
issues of a Computer System and is available for use by
Researcher, Scientists and for all of them who want to work as an
Architect for a computer system.
2
Computer Architecture
Computer
Architecture CA
Computer
CO
Organization
Computer Design CD
Computer Architecture is at the top of Computer Organization and Design
3
Computer Architecture is …
“…the structure of a computer that a machine language programmer
must understand to write a correct (timing independent) program for that
machine.”
Amdahl, Blaauw, and Brooks, 1964

“Architecture of the IBM System 360”,
IBM Journal of Research and Development
“…the interface between what technology can provide and what the
marketplace demands.”
“Computer architecture is a science of trade-offs.”

4
“…the science and art of selecting and interconnecting hardware
components to create computers that meet functional, performance and
cost goals.”
Computer Architecture …
“…forms the bridge between application need and the capabilities of the
underlying technology.”
5
 We cannot architect a new computer without defining
performance, power and cost goals. The design
process is all about understanding and making trade-
offs.
 What is our target market and what applications will
we be running?
 The “best” architecture is a moving target
 The needs of the marketplace change
 Shifting fabrication technology characteristics
 New technologies
 memory, packaging, compiler, languages, ...
6
 Why study computer architecture?
 After all, I just buy a computer whose architecture has already
been designed by smart people and that I cannot change
 Reason #1: One day you may design computers
 not very likely
 Reason #2: You will care about software performance
 Very likely
 Only by understanding architecture can one design fast software
 It’s not all about having faster clock rates
 Viewing the computer as a “black box” is not enough
 Reason #3: You will make decisions for hardware
purchases
 Very likely
 And these decisions will have to be justified to your boss
 Reason #4: Some fundamental concepts of computer
science arise in computer architecture
 For instance in systems: caching, paging
7
Computer System Components
 All Computers consist of Five Components
 Processor
 Datapath
 Control
 Memory
 I/O
 Input devices
 Output devices
Organization of a computer showing five classical components 8

Computer System Components
 All computers consists of five classical components.
 CPU
 Memory
 Input Devices
 Output Devices
 Storage Devices
 The processor gets instructions and data from

memory.
 Input and Output operations are performed through
memory.
 Control sends signal that determine the operations of
the datapath, memory, input, output and storage.
9
Functional Level Aspects of Design
User Application
Language Subsystems Utilities
Compiler Operating System

Software
Instruction Set Architecture Hardware
Hardware Organization
CPU Memory I/O Coprocessor
10
Instruction Set Architecture (ISA)
 Instruction Set Architecture (ISA) defines/specifies all
the components and operationsof a computer that
can be used by a programmer:
 Memory Organization
 Address Space - How many locations can be addressed
 Addressability - How many bits per location can be accessed
 Register Set
 How many registers are there
 What is the size of registers
 What type of registers are there
 Instruction Set
 OPcodes
 Data types
 Addressing modes
11
Instruction Set Architecture (ISA)
 ISA provides all information needed for someone that
wants to write a program in machine language (or
translate from a high-level language to machine
language).
 ISA is an interface between software and hardware.
12
Task of Computer Designer
 Determine what attributes are important for a new
machine; and then design a machine to:
 Maximize
 Performance (while staying within cost constraints)
 Minimize
 Cost and
 Power constraints
 Aspects of this task include:
 Instruction Set Design
 Functional Organization
 Logic Design and Implementation
 Optimizing the design requires expertise from

Compilers and OS to Logic Design and Packaging.
13
Evolution and Computer Designs
 1960s  Main Frames
 Large machines, cost in millions of dollars, used in Business
Data Processing and Scientific computing.
 1970s  Mini Computers
 Used in Scientific applications as well initiated the multi-user
environment.
 1980s  Micro Computers
 Initiated Desktop computing, Personal computers and
Workstations.
 1990s  Personal Digital Assistant (PDAs)
 2010s  Wearable, ingestible, biological computing
 Future Trends
 Large number of small machines
14
Ingestible Computers
15
Microprocessors 1960s to date
Mainframe Work-
Server station PC
Supercomputer Mini-
supercomputer
Massively Parallel Processors
16
Computing Markets
 The evolution in the Computer Industry have led to
three Computing Markets.
 Desktop Computing
 Servers specific Machines
 Embedded Computers
17
Desktop Computing
 Desktop based machines/computers include
 Low-end systems to high-end workstations.
 Main emphasis in Desktop computing is on

 Price &
 Performance optimization
 In Desktop based machines we are more concerned

about:
 Throughput
 Cost
 Power
18
Server Specific Machines
 With the advent of desktop computing the
importance of servers for reliable file and computing
services grew.
 Emergence of web based structures made the
requirement of servers very strong.
 Performance issues regarding server based machines
can be that:
 How many users can be handled at a particular instant of
time
 How many programs can be handled per unit time by the
server
19
Server-Specific Machines
 Emphasis in server based machines is on:
 Availability
 Is an important criteria and is in a sense reliability
 MTTF (Mean Time to Fail)
– Lifespan of a device, (Should be high, increased time)
 MTBF (Mean Time Between Failure)
– Time between two failures, (Should be high, increased time)
 MTTR (Mean Time To Repair)
– Repair time after a failure, (Should be low, i.e. reduced time)
 Response Time
 How many users get the Response from Server at a particular instance
of time.
 Scalability
 Another key feature (memory, computing capacity, Storage, I/O bandwidth
of a server.
 Throughput
 Requests handled in a unit time.
 Should be high.
20
Uptime Calculation
 A website is monitored during 24 hours (which
translates to 86,400 seconds), and in that time frame
the website went down for 10 minutes (600
seconds). To define the uptime and downtime
percentages, we perform the following calculation:
 Total number of seconds for which website was down:
 Down time = 600 seconds
 Total number of seconds for which website was monitored:
 Total time = 86,400 seconds
 Divide 600 by 86,400, which is 0.0069.
 In percentages, this is 0.69% which corresponds to downtime
percentage.
 The uptime percentage for this website would be:
 Uptime = 100% minus 0.69% = 99.31%.
21
MTTF (Mean Time To Fail)
 Mean Time To Fail (MTTF) is a very basic measure of
reliability used for non-repairable systems.
 MTTF represents the length of time that an item is expected
to last in operation until it fails.
 MTTF is commonly referred to as the lifetime of any
product or a device.
 MTTF value is calculated by looking at a large number of the
same kind of items over an extended period of time and
seeing what is their mean time to failure.
 When used MTTF as a failure metric, repair of the
asset is not an option.
22
MTTF (Mean Time To Fail)
 Let’s assume that we tested three identical devices
until all of them failed.
 The first device failed after eight hours,
 The second one failed after ten hours, and
 The third device failed after twelve hours.
 Now MTTF of this device type would be:
(8 + 10 + 12) / 3 = 10 hours
 Above example leads us to a conclusion that a
particular type and model of that device needs to be
replaced, on average, every 10 hours.
 MTTF is an important metric used to estimate the
lifespan of devices that are not repairable.
 Shorter MTTF means more frequent downtime and
disruptions that portrays that device is not Reliable.
23
MTBF (Mean Time Between Failure)
 MTBF (Mean Time Between Failure)
 Mean Time Between Failure is a measure of total uptime of
the components(s) divided by the total number of failures.
 Formula for calculating MTBF is:
 MTBF calculation example:

 Total uptime : 8 hours
 Number of failures : 2
 MTBF calculated is : 4 hours

 Failure rate is : 0.25 per hour
24
MTBF (Mean Time Between Failure)
25
Failure Rate
 Failure Rate
 Failure Rate is a simple calculation derived by taking the
inverse of the Mean Time Between Failures:
 Failure Rate is a common tool to use when planning and

designing systems, it allows to predict a component or
systems performance. Failure rate is most commonly
measured in number of failures per hour.
 Failure rate calculation:

 MTBF Time : 4 hours
 Failure Rate : 1 / 4 (0.25 per hour)
26
MTTR (Mean Time To Repair)
 Mean Time To Repair (MTTR) refers to the amount of
time required to repair a system and restore it to full
functionality.
 Taking too long to repair a system or equipment is

not desirable and especially the case for processes
that are particularly sensitive to failure.
27
MTTR (Mean Time To Repair)
 MTTR (Mean Time To Repair)
 Mean Time To Repair (MTTR) is a measure of the average
downtime. The formula for MTTR is:
 MTTR takes the downtime of the system and divides it by the

number of failures.
 MTTR Calculation example:
 Total downtime : 4 hours
 Number of failures : 2
 MTTR Calculated is : 2 hours
 MTTR can be a useful tool for Preventative maintenance and
other maintenance repair processes. It allows to effectively
plan maintenance around the time taken to repair so you can
optimize time spent on maintenance to minimize downtime.
28
Availability
 Availability is an important metric used to assess the
performance of repairable systems.
 Availability refers to the capability of the system for
which it remains available for service. There are
different classifications of availability, including:
 Instantaneous (or Point) Availability
 Average Uptime Availability (or Mean Availability)
 Steady State Availability
 Inherent Availability
 Achieved Availability
 Operational Availability
 Formula for calculating Availability of a system is:
29
Availability
 Availability calculation example
 Total uptime = 8
 Number of failures = 2
 MTBF = 4
 Failure rate = 0.25
 Downtime = 4
 Number of failures = 2
 MTTR = 2
 Availability = [4 / (4+2) x 100] = 66.66%

 Unavailability 100 - 66.66 = 33.33%
30
Embedded Systems
 An embedded system is a computer system designed
to perform one or a few dedicated functions, often
with real-time computing constraints.
 Embedded systems control many of the common
devices in use today, i.e. Microwaves, Washing
machines, Printers, Network switches, Cars,
Palmtops, Cell phones etc.
 Some embedded computers are programmable, but
more crucial are those where Programming is done
on initial loading or up-gradation.
 There is a great diversity in embedded computers
from 8-bit microprocessors to 32-bit microprocessors
capable of executing several Million instructions per
second used in Network Switches.
31
Embedded Systems
 There are two important constraints in Embedded
computers:
 Performance in real time
 Minimum Cost
 Two other equally important constraints are:

 Minimize Memory
 Minimize Power
 Another important aspect for Designer in embedded

systems is to integrate the Custom hardware with the
software built on standard embedded processor.
32
Performance Basics:
Execution Time vs Throughput
 How to measure that which Computer is faster out of
two?
 Execution Time (response time, latency,… )
 Time to complete one task
 Throughput (bandwidth)
 Tasks per unit of time (sec, hour, day,… )
 How to calculate that Machine X is ?? times faster
than Machine Y
 Performance and Execution time are reciprocals of
each other, so increase in performance results
decrease in execution time. 1
Performance =
Execution-Time 33
Execution Time
 How does one define execution time?
 sounds easy but...
 The most common understanding is somewhat known as the
wall-clock time
 Start a timer and launch a program at the same instant
 When the program completes, stop the timer
 The elapsed time is the wall-clock time
 The wall-clock time includes all activities of the program
 Another metric is CPU time
 Wall-clock time NOT including time spent waiting for I/O, or time spent
waiting for the CPU to be free due to multiple processes running at the
same time
 CPU time has two components:
 User time: time spent in the program itself
 System time: time spent in the operating systems
 e.g., when calling system calls
Wall-clock time > User time + System time 34
Execution Time Calculation: By Hand
 One possibility would be to do this by just “looking”
at a clock, launching the program, “looking” at the
clock again when the program terminates
 This of course has some drawbacks
 Poor resolution
 Requires the user’s attention
 Only for wall-clock time
 Therefore operating systems provide ways to time
programs automatically
 UNIX/Linux provides the time command
35
The UNIX/Linux time Command
 We can put time in front of any UNIX/Linux command we invoke
 When the invoked command completes, time prints out timing (and
other) information, i.e.
 $ % time ls /home/casanova/ -la -R
0.520u 1.570s 0:20.58 10.1% 0+0k 570+105io 0pf+0w
 0.520u 0.52 seconds of user time

 1.570s 1.57 seconds of system time
 0:20.58 20.56 seconds of wall-clock time
 10.1% 10.1% of CPU was used
 0+0k memory used (text + data)
 570+105io 570 input, 105 output (file system I/O)
 0pf+0w 0 page faults and 0 swaps
36
Linux time command example
 [root@sajjad ~]# /usr/bin/time -v du /
 Command being timed: "du /"
 User time (seconds): 0.04
 System time (seconds): 6.71
 Percent of CPU this job got: 38%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.73
 Average shared text size (kbytes): 0
 Average unshared data size (kbytes): 0
 Average stack size (kbytes): 0
 Average total size (kbytes): 0
 Maximum resident set size (kbytes): 0
 Average resident set size (kbytes): 0
 Major (requiring I/O) page faults: 0
 Minor (reclaiming a frame) page faults: 981
 Voluntary context switches: 3869
 Involuntary context switches: 53630
 Swaps: 0
 File system inputs: 0
 File system outputs: 0
 Socket messages sent: 0
 Socket messages received: 0
 Signals delivered: 0
 Page size (bytes): 4096
 Exit status: 0
37
Performance checking on a Dedicated
System
 Measuring the performance of a code must be done on a
quiescent/unloaded machine
 The machine only runs the standard OS processes
 The machine must be dedicated
 No other user can start a process
 The user measuring the performance only runs the minimum
amount of processes
 basically, a shell
 Nevertheless, one should always present measurement results
as averages over several experiments
 Because the (small) load imposed by the OS is not deterministic
38
Drawbacks of UNIX/Linux time
command
 The time command has poor resolution
 “Only” milliseconds
 Sometimes one wants a higher precision, especially if
performance improvements are in the 1-2% range
 time times the whole code
 Sometimes one is only interested in timing some part of the
code
 Sometimes one wants to compare the execution time of
different sections of the code
39
Calculating Execution Time of a code
in Java
 ntp_gettime() (Internet RFC 1589)
 Sort of like gettimeofday, but reports estimated error on time measurement
 Not available for all systems
 Part of the GNU C Library
 Java: System.currentTimeMillis()
 Known to have resolution problems, with resolution higher than 1 millisecond!
 Solution: use a native interface to a better timer
 Java: System.nanoTime()
 Added in J2SE 5.0
 Probably not accurate at the nanosecond level
 Tons of “high precision timing in Java” on the Web
40
Levels and Types of Benchmarks
A benchmark is a test that measures the
performance of hardware, software, or computer.
These tests can be used to help compare how well a
product may do against other products.
 Based on the levels of performance they measure,
benchmarks can be grouped into two levels:
 Component-Level Benchmarks
 System-Level Benchmarks
 Based on their compositions, benchmarks can be

categorized into types:
 Synthetic Benchmarks (Component-Level)
 Application Benchmarks (System-Level)
41
Component-Level Benchmarks
 Component-Level benchmarks test a specific component of a
computer system, such as the video board, the audio card, or
the microprocessor.
 They are useful for selecting a component of a computer
system, that corresponds to a particular function.
 Instead of testing the performance of the system running real
applications, component-level benchmarks focus on the
performance of subsystems within a system.
 These subsystems may include arithmetic integer unit,
arithmetic floating-point unit, memory system, disk
subsystem, etc.
 Examples of component-level benchmarks include:
 SPECweb96 - measures the web server performance
 GPC - measures graphics performance for displaying 3-D images
42
System-Level Benchmarks
 System-Level benchmarks evaluate the overall performance of
a computer running real programs or applications.
 These benchmarks are useful when comparing systems of
different architectures.
 They take each subsystem into account, and indicate the
effect of each subsystem on the overall performance.
 Examples of system-level benchmarks include:
 SYSmark/NT 4.0 - measures the performance of computers running
popular business applications under Windows NT 4.0
 TPC-C - measures the performance of transaction processing systems
43
Synthetic Benchmarks
 Synthetic benchmarks are component-level benchmarks, and
they evaluate a particular capability of a subsystem. For
example, a disk subsystem performance benchmark may
combine a series of basic seek, read, and write operations
involving varying numbers of disk blocks of varying sizes.
 When evaluating the results from synthetic benchmarks, the
following rules should be followed:
 Understand the composition of the benchmark,
 Understand the factors contributing to the results,
 Examples of synthetic benchmarks include:
 WinBench 97 - measures the performance of a PC's graphics, disk,
processor, video and CD-ROM subsystems in Windows environment
 MacBench 97 - measures the processor, floating-point, graphics, video,
and CD-ROM performance of a MAC OS system
44
Application Benchmarks
 Application benchmarks employ actual application programs.
 Most application benchmarks are system-level benchmarks,
and they measure the overall performance of a system.
 When an application benchmark is run, it tests the
contribution of each component of the system to the overall
performance.
 These benchmarks are usually larger and difficult to execute
and are not useful for measuring future needs.
 Examples of application benchmarks include:
 Winstone 97 - tests a PC's overall performance when running Windows-
based 32-bit business applications
 SYSmark/NT 4.0 - measures the performance of computers running
popular business applications under Windows NT 4.0
45
Benchmark Suites
 Typically, many benchmarks are packaged together in “suites”
 Most famous/successful benchmarks results are from SPEC
(Standard Performance Evaluation Corporation)
 Cover different types of applications
 www.spec.org
 There are other benchmarks
 Business Winstone
 CC Winstone
 Winbench
46
SPEC Benchmark - Report
Source:
http://www.spec.org/cloud_iaas2018/results/res2019q4/cloudiaas2018-20191113-00005.html
47
Interpreting Benchmark Results
 Let’s say your boss asks you to buy a server for hosting a Web server and
a database server for 8K
 Your must find the server with the best performance for that price
 Now you know that you should look for
 SPEC (Standard Performance Evaluation Group www.spec.org ) and
 TPC (Transaction processing Performance Council www.tpc.org )
results in magazine or in documentation provided by the vendors
 Question: how to interpret published benchmark results?
 A few general principles
 Pay attention to the software configuration
 And in particular to OS
 many benchmarks run in single-user mode or with “enhanced” OS configurations
 Pay extreme attention to the compiler technology
48
Obtaining Benchmark Results
 No matter what, people running benchmarks can always
 Play unforeseen tricks
 Fail to report on the full system information
 For this reason, benchmarks come with different requirements
 No source code modification allowed (SPEC)
 just pay attention to the software configuration and the compiler
 Source code modifications allowed, but are difficult (TPC-C)
 use of proprietary software, OS difficult to modify
 Source code modifications allowed
 supercomputer benchmarks
 EEMBC (so-called “optimized results”)
 Hand-coding allowed (EEMBC)
 allows assembly code to be written (can speed up code by up to 80x)
49
Summarizing Benchmark Results
 Benchmark results are often misleading
 Incomplete information on how the results were obtained
 Underhanded tactics by companies and designers
 To make things worse, benchmark suites contain
multiple programs, so what is reported is the
performance of a system on a variety of programs
 Question: how do we summarize those results?
50
Measuring and Reporting Performance
 For a
 Desktop Computer user
 A computer will be faster when a Program runs in less time.
 Large Server Manager
 A computer will be faster when more jobs will be completed in less
time.
– In this environment for a User the Less response time will be the
criteria of performance.
 Large Data Processing environment/system
 Overall system will be accredited on its throughput, i.e.
– Total amount of work done in a given time.
 Response time is normally calculated for a single

task, and throughput is calculated for multiple tasks
51
Measuring and Reporting Performance
 So we can define these metrics in Mathematical
terminology as: 1
Performance =
Execution-Time
 As Execution time is the reciprocal of Performance,
so it can be written as:
Performance of Machine A
n=
Performance of Machine B
1 / Execution time of A
n=
1 / Execution time of B
Execution time of Machine B

n=
Execution time of Machine A
52
The CPU Performance Equation
 Following is the CPU Performance equation:
CPU Time = CPU clock cycles for a program * Clock cycle time
CPU clock cycles for a program = CPI * IC
 Now the above formula can be used to improve the

formula of CPU time as:
CPU time = IC * CPI * Clock cycle time
 Now using the units of measurement, CPU-time can

be transformed into:
Instructions Clock cycles Seconds Seconds
Program
* Instruction
* Clock cycle
* Program
= CPU time
53
CPU Performance & Time
 So it illustrates that the CPU performance and CPU
time is dependent on three characteristics:
 clock cycle time (or rate)
 clock Cycles Per Instruction (CPI)
 Instruction Count (IC)
 Suppose Execution time of We assume that all above 5 instructions

are homogeneous, but actually our
 Machine A = 10, and instructions are not of homogeneous
nature.
 Machine B = 15
 Execution time of A = CPI * IC * clock cycle time

Execution time of A = CPI * IC * clock cycle time
= 2 * 5 * 1
= 10
54
CPU Performance & Time
Performance of Machine A
n=
Performance of Machine B
1 / Execution time of A
n=
1 / Execution time of B
Execution time of Machine B

n=
Execution time of Machine A
15
n=
10
n= 1.5
 So, Machine B is 1.5 times slower than Machine A, or

 Machine A is 1.5 times faster than Machine B.
55
Performance Characteristics
Dependability
 The parameter discussed previously depends upon
the following:
 Clock Cycle Time or Rate depends on
 Hardware technology and Organization
 clock Cycles Per Instructions (CPI) is dependent on
 Organization and Instruction Set Architecture
 Instruction Count (IC) depends on
 Instruction Set Architecture and Compiler Technology
56
Quantitative Principles of Computer
Design
 After defining, measuring and summarizing performance, we’ll
explore some principles that are useful in Design and Analysis of
Computers.
 So, the answer to the question that “What are the main principles
people use for designing computers in order to gain high
performance” is to use the persistent rule in Computer Design, i.e.
“make the common case fast”,
i.e. favor the frequent case over the infrequent one.
 When adding two numbers in CPU, we can expect overflow to be a
rare case and can therefore improve performance by optimizing the
more common case of no overflow.
 Performance can decrease, in case of overflow, but it is rare, i.e.
not the common case.
 So, the overall performance will be improved by optimizing for the
normal case, i.e. common case. Often the common case is also the
simplest. 57
Amdahl’s Law
 In order to apply make the common case fast, for
performance enhancement, we’ll have to decide:
 What the frequent case is and
 How much performance can be improved by making that case
faster.
 A fundamental law, called Amdahl’s Law, can be used
to quantify this principle.
58
Amdahl’s Law
 Amdahl’s Law defines the speedup that can be gained by
using a particular feature.
 Speedup is the enhancement that we can make in a machine that
will improve performance when it is used.
Speedup = Performance for entire task using the enhancement when possible
Performance for entire task without using the enhancement
Alternatively,
Speedup = Execution time for entire task without using the enhancement
Execution time for entire task using the enhancement when possible
 Speedup tells us how much faster a task will run using the
machine with the enhancement as opposed to the original
machine.
 Amdahl’s Law gives us a quick way to find the speedup from
some enhancement, which depends on two factors:
 The fraction of the computation time in the original machine that
can be converted to take advantage of the enhancement
 The improvement gained by the enhanced execution mode
59
Amdahl’s Law
 The fraction of the computation time in the original machine that can be
converted to take advantage of the enhancement, i.e. if 20 sec. of the execution time
of a program that takes 60 sec. in total can use an enhancement, the fraction is 20/60. This
value which we will call Fractionenhanced is always <= 1.
 The improvement gained by the enhanced execution mode, i.e. how much faster the
task would run if the enhanced mode were used for the entire program. This value is the time
of the original mode over the time of the enhanced mode. If the enhanced mode take 2 sec.
for some portion of the program that can completely use the mode, while the original mode
took 5 sec. for the same portion, then the improvement is 5/2. We’ll call this value
Speedupenhanced which is always > 1,.
Time for Fraction F to be Execution time of the Fraction

Enhanced by factor S Enhanced
Original Execution
Execution Time
Time of Task after fraction F
Enhanced by factor S
60
Amdahl’s Law
 Execution time using the original machine with the enhanced mode will be
the time spent using the unenhanced portion of the machine plus the time
spent using the enhancement.
Execution-Timenew = Execution-Timeold x 1 – Fractionenhanced + Fractionenhanced
Speedupenhanced
 The overall speedup is the ratio of the execution times.
Speedupoverall Execution-Timeold 1
= =
Execution-Timenew 1 – Fractionenhanced Fractionenhanced
+
Speedupenhanced
61
Example # 01
 Suppose we are considering an enhancement to the processor
of a server system used for web serving. The new CPU is 10
times faster on computation in the web serving applications
than the original processor. Assuming that the original CPU is
busy with computations 40% of the time and is waiting for
I/O 60% of the time. What is the overall speedup gained by
incorporating the enhancement?
 Answer
 Fractionenhanced = 0.4
 Speedupenhanced = 10
1 1 1
 Speedupoverall = = = = 1.56
1 - 0.4 + 0.4 0.6 + 0.4 0.64
10 10
62
Example # 02
 A common transformation required in graphics engines is square root.
Implementations of floating-point square root vary significantly in performance,
especially among processors designed for graphics. Suppose FP square root
(FPSQR) is responsible for 20% of the execution time of a critical graphics
benchmark. One proposal is to enhance the FPSQR hardware and speed up this
operation by a factor of 10. The other alternative is just to try to make all FP
instructions in the graphics processor run faster by a factor of 1.6; FP instructions
are responsible for a total of 50% of the execution time for the application. The
design team believes that they can make all FP instructions run 1.6 times faster
with the same effort required for the fast square root. Compare these two design
alternatives.
 These two alternatives can be compared using the Speedup comparisons:
1 1 1
 SpeedupFPSQR = = = = 1.219
1 - 0.2 + 0.2 0.8 + 0.2 0.82
10 10
1 1 1
 SpeedupFP = = = = 1.230
1 - 0.5 + 0.5 0.5 + 0.5 0.8125
1.6 1.6
63
Example # 03
 Suppose we can improve Load/Store instructions by 5x, i.e. 5
times speedup, and we further suppose that 10% of the
instructions are Load/Store?
 What is the Overall Speedup obtained in this scenario?
 Answer
1 1 1
1 - 0.1 + 0.1 0.9 + 0.1 0.92
5 5
 More % of the fraction enhanced, more will be the effect in
performance and this is the Amdhal’s Law.
64
Example # 04
 Suppose we can improve Load/Store instructions by 5x, i.e. 5
times speedup, and we further suppose that 50% of the
 Answer
1 1 1
1 - 0.5 + 0.5 0.5 + 0.5 0.6
5 5

65
Example # 05
 Suppose we can improve Load/Store instructions by 20x, i.e.
20 times speedup, and we further suppose that 10% of the
 Answer
1 1 1
1 - 0.1 + 0.1 0.9 + 0.1 0.90
20 20

66
Example # 06
 Calculate overall speedup if we make 90% of a program run
10 times faster?
 Answer
1 1 1
1 - 0.9 + 0.9 0.1 + 0.09 0.19
10

67
Example # 07
 Calculate overall speedup if we make 80% of a program run
20 times faster?
 Answer
1 1 1
1 - 0.8 + 0.8 0.2 + 0.04 0.24
20

68
Example # 08
 What fraction of the computations should be able to use the
floating–point processor in order to achieve an overall
speedup of 2.25?
 Answer
 Fractionenhanced =?
1 15 15
 2.25 = = =
1-F + F 15 – 15F + F 15 – 14F
15
 2.25(15 – 14F) = 15
 33.75 – 31.5F = 15
 31.5F = 18.75
18.75
 F = = 0.595 = 59.5 OR 60%
31.5
 Fractionenhanced = 60%
69
Example # 09
 We have a system that contains a special processor for doing
floating-point operations. We have determined that 50% of
our computations can use the floating-point processor. The
speedup of the floating pointing-point processor is 15.
 What is the overall speedup achieved by using the floating-point processor?
 Answer
 Fractionenhanced = 0.5 AND Speedupenhanced = 15
1 1 1
1 - 0.5 + 0.5 0.5 + 0.033 0.533
15
 What is the overall speedup achieved if we modify the compiler so that 75% of the
computations can use the floating-point processor?
 Answer
 Fractionenhanced = 0.75 AND Speedupenhanced = 15
1 1 1
1 - 0.75 + 0.75 0.25 + 0.05 0.3
15 70
Optimization (Improving Performance)
 Hardware issues can not be optimized more as the
optimization mostly depends upon the software
issues that ultimately optimizes the performance.
 Some ways of optimization are:
 Locality
 Temporal Locality
 Spatial Locality
 Parallelism
 Parallelism at System Level
 Parallelism at the level of an Individual Processor (Pipelining)
71
Properties to Exploit in Computer
Design: Locality
 Although Amdhal’s Law is a theorem that applies to any
system, other important fundamental observations come from
properties of programs, and the most important program
property that we regularly exploit is Principle of Locality.
 Programs tend to use data and instructions they have used
recently. A widely held rule of thumb is that a program spends
90% of its execution time in only 10% of the code.
 An implication of locality is that we can predict with
reasonable accuracy that what instructions and data a
program will use in near future based on its accesses in the
recent past.
 Temporal Locality states that recently accessed items are likely
to be accessed in the near future.
 Spatial Locality states that items whose addresses are near to
one another tend to be referenced close together in time. 72
Principles of Locality
 Locality is important because it allows us to predict
what the processor will do next with reasonable
accuracy based on what it did before
 Temporal locality: having accessed a location , this
location is likely to be accessed again
 Therefore, if one can keep recently accessed data items
“close” to the processor, then perhaps the next instructions
will find them ready for use.
 Spatial locality: having accessed a location, a nearby
location is likely to be accessed next
 Therefore, if one can bring in contiguous data items “close”
to the processor at once, then perhaps a sequence of
instructions will find them ready for use.
73
Locality: Example
 Consider the following sequence of addressed
references by the CPU
1,2,1200,1,1200,3,4,5,6,1200,7,8,9,10
 This sequence has “good Temporal Locality” as
location 1200 is referenced three time out of 14
references
 This sequence has “good Spatial Locality” as
locations [1,2], [3,4,5,6] and [7,8,9,10] are
addressed in sequence
 Like the elements of an array that are accessed one after the
other
74
Properties to Exploit in Computer
Design: Parallelism
 One of the most important method of improving
performance is to take advantage of Parallelism.
 Parallelism refers to an act of performing two or
more tasks or operations simultaneously.
 Parallelism operates at two levels:
 Hardware
 Software
75
Types of Parallelism
 Parallelism is broadly classified into two types:
 Parallelism in Hardware
 Parallelism in Uniprocessor environments
– Pipelining
– Superscalar, VLIW etc
 Vector Processors, GPUs and SIMD Instructions etc
 Parallelism in Multiprocessor environments
– Symmetric shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip multiprocessors, e.g. multicore architectures
 Parallelism in Software
 Instruction Level Parallelism (ILP)
 Task Level Parallelism (TLP)
 Data Parallelism
 Transaction Level Parallelism
76
Pipelining
 In Pipelining, the Processor is divided into functional units.
 Division of a processor into 5 functional units is given below:
 Instruction Fetch IF
 Instruction Decode ID
 Instruction Execute IE
 Memory Access MA
 Write Back WB
 Write back
– Memory is updated after the complete execution of the block.
 Write through
– Memory is updated immediately after the change in instruction.
 Memory Access and Write Back are the after effects of

Pipelining.
 In Pipelining we reduce the interdependency of the blocks and
increase the execution times of each block.
77
Without Pipelining
 In a Sequential Processor, i.e. where the concept of
Pipelining is not implemented, 5 instructions will take
5 x 5 = 25 cycles.
1 2 3 4 5
78
Without Pipelining
1 2 3 4 5
IF ID IE MA WB 1st Instruction completed in 5th Cycle

1 1 1 1 1
2nd Instruction completed in 10th Cycle

IF ID IE MA WB
2 2 2 2 2
3rd Instruction completed in 15th Cycle

IF ID IE MA WB
3 3 3 3 3
4th Instruction completed in 20th Cycle

IF ID IE MA WB
4 4 4 4 4

IF ID IE MA WB
5 5 5 5 5 79
Without Pipelining
1 2 3 4 5
80
Without Pipelining - 01
1st Instruction Enters 1 2 3 4 5
IF
1
81
1 2 3 4 5
ID
1
82
1 2 3 4 5
IE
1
83
1 2 3 4 5
MA
1
84
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
1
85
2nd Instruction Enters 1 2 3 4 5
IF
2
86
1 2 3 4 5
ID
2
87
1 2 3 4 5
IE
2
88
1 2 3 4 5
MA
2
89
1 2 3 4 5 2nd Instruction completed in 10th Cycle
WB
2
90
3rd Instruction Enters 1 2 3 4 5
IF
3
91
1 2 3 4 5
ID
3
92
1 2 3 4 5
IE
3
93
1 2 3 4 5
MA
3
94
1 2 3 4 5 3rd Instruction completed in 15th Cycle
WB
3
95
4th Instruction Enters 1 2 3 4 5
IF
4
96
1 2 3 4 5
ID
4
97
1 2 3 4 5
IE
4
98
1 2 3 4 5
MA
4
99
1 2 3 4 5 4th Instruction completed in 20th Cycle
WB
4
100
IF
5 101
1 2 3 4 5
ID
5 102
1 2 3 4 5
IE
5 103
1 2 3 4 5
MA
5 104
WB
5 105
Without Pipelining
Without pipelining, it took 25 cycles to complete 5 instructions. 106

Pipelining
 In Pipelined Processor, if 5 instructions are not
interdependent on each other, than they will take
only 9 cycles to complete the execution.
1 2 3 4 5
107
Pipelining
1 2 3 4 5
IF ID IE MA WB 1st Instruction completed in 5th Cycle

1 1 1 1 1
2nd Instruction completed in 6th Cycle

IF ID IE MA WB
2 2 2 2 2
3rd Instruction completed in 7th Cycle

IF ID IE MA WB
3 3 3 3 3

IF ID IE MA WB
4 4 4 4 4

IF ID IE MA WB
5 5 5 5 5 108
Pipelining
1 2 3 4 5
109
Pipelining - 01
1st Instruction Enters 1 2 3 4 5
IF
110
Pipelining - 02
1 2 3 4 5
ID
2nd Instruction Enters 1 2 3 4 5
IF
111
Pipelining - 03
1 2 3 4 5
IE
1 2 3 4 5
ID
3rd Instruction Enters 1 2 3 4 5
IF
112
Pipelining - 04
1 2 3 4 5
MA
1 2 3 4 5
IE
1 2 3 4 5
ID
IF
113
Pipelining - 05
WB
1 2 3 4 5
MA
1 2 3 4 5
IE
1 2 3 4 5
ID
IF
114
Pipelining - 06
WB
1 2 3 4 5
MA
1 2 3 4 5
IE
1 2 3 4 5
ID
115
Pipelining - 07
WB
1 2 3 4 5
MA
1 2 3 4 5
IE
116
Pipelining - 08
WB
1 2 3 4 5
MA
117
Pipelining - 09
WB
With pipelining, it took 9 cycles to complete 5 instructions. 118
Pipelining
1 2 3 4 5
119
Pipelining
 The concept of Pipelining is used in Uni Processor
environment.
 Pipelining concept makes computing and processing
very complex.
 Pipelining is not used in Multiprocessing environment
as we already have the hardware duplicates available
there, i.e. more no. of processors etc.
120
Computer Design Cycle
 The computer design and developments have been under the influence of:
 Technology
 Performance
 Cost
Technology
Performance
Cost
 The new trends in technology have been used to reduce the cost and
enhance the performance
 The decisive factors for rapid changes in the computer development have
been the performance enhancements, price reduction and functional
improvements.
121
Computer Design Cycle: Performance
 The computer design is evaluated for bottlenecks
using certain benchmarks to achieve the optimum
performance.
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Performance
Technology and cost
122
Computer Design Cycle: Performance
 Performance metrics are:
 Time/Latency
 Time is the key measurement of performance,
 A desktop user may define the performance of his/her machine in
terms of time taken by the machine to execute a program; whereas
 A computer center manager running a large server system may
define the performance in terms of the number of jobs completed
in a specified time.
 Throughput
 The desktop user is interested in reducing the response time or
execution time, i.e. time between the start and completion of an
event; while
 A data processing center is interested in increasing the throughput,
i.e. the number of jobs completed per unit time
 Other measures such as MIPS, MFLOPS, clock
frequency (MHz), cache size do not make any sense.123
Performance Measuring Tools
 Benchmarks:
 Benchmarks are the programs which try to average frequency of
operation and operands of a large set of programs.
 Standardized benchmarks tools are available from Standard
Performance Evaluation Corporation – SPEC at www.spec.org
 Hardware:
 Cost, Delay, Area, Power consumption
 Simulation at different levels of design abstraction, i.e.
 ISA, RT, Gate, Circuit
 Queuing Theory
 To calculate the response time and throughput of entire I/O system
 Rules of Thumb
 Fundamental “Laws”/Principles
124
Computer Design Cycle: Technology
 The Technology Trends motivate new designs.
 The new designs are simulated to evaluate the performance for different
levels of workloads
 Simulation helps in keeping the result verification cost minimum.
Performance
Evaluate Existing Systems for Bottlenecks using Benchmarks
Technology
Simulate
New Designs
Workloads
and
Organizations
 The cost-performance is optimized for workloads 125

Technology Trends: Computer
Generations
 Vacuum tube 1946-1957 1st Gen.
 Transistor - 1958-1964 2nd Gen.
 Small scale integration 1965-1968
 Up to 100 devices/chip
 Medium scale integration 1969-1971 3rd Gen.
 100-3,000 devices/chip
 Large scale integration 1972-1977
 3,000 - 100,000 devices/chip
 Very large scale integration 1978 on.. 4th Gen.
 100,000 - 100,000,000 devices/chip
 Ultra large scale integration
 Over 100,000,000 devices/chip
126
Technology Trends: Computer
Generations
 The technology trends at different times in the history of the
computer development have played major role in the journey
of computers form uniprocessor to multiprocessor mainframe
systems; and from super structure to desktop computing.
 Usually identified by dominant implementation technologies,
the length of first three generations spreads over a period of 8
to 10 years.
 The first generation, 1950-59, is the era of commercial electronic
computers employing Vacuum Tube technology.
 The second generation, 1960-68, is the period of cheaper
computers made using Transistor or discrete component
technology.
 The third generation, 1969-77, is the age of minicomputers
developed employing Integrated Circuit technology.
 The fourth generation, 1978 to date, the eon of personal computers
and workstations, is the result of VLSI and ULSI technology.
 The gateway to the fifth and higher generations is not clear as no
revolutionary technology has been proclaimed. 127
Technology Trends: Processor
Transistor density
 The present transistor technology has developed to an extent where up to 15 million
transistors are integrated on one VLSI chip to make PowerPC and Pentium Processors
128
Technology Trends: Processor
Performance
 In 1980, the performance of supercomputers was 50 times that of microprocessor, 20 times
of minicomputers and 10 time that of mainframe computers.
 The performance of microprocessors was equivalent to that of supercomputer in 1992 and it

was much higher in 1995 and onwards 129
Computer Design Cycle: Cost
 The systems are implemented using the latest technology to obtain cost effective,
high performance solution.
 The implementation complexities are given due consideration
Performance
Implementation
Complexity
Cost
Implement Technology
Next Generation
System
130
Cost, Price and their Trends
 Although there are some Computer designs like
supercomputers etc, where cost doesn’t matter, cost
sensitive designs are of growing significance.
 For the last 15-20 years, the use of technology
improvements to achieve lower cost, as well as
increased performance, has been a major theme in
the computer industry.
 We know that
 Price is what we sell a finished good for, and
 Cost is the amount spent to produce it, including overhead.
 Understanding of cost and its factors is essential for
designers to be able to make intelligent decisions
about whether or not a new feature should be
included in designs where cost is an issue.
(Imagine an architect designing skyscraper without any information on costs of steel and concrete etc) 131
Price versus Cost
 The relationship between cost and price is complex.
 Cost refers to
 The total amount spend to produce a product.
 The cost passes through different stages before it

becomes price.
 Price is
 The amount for which a finished good is sold.
 A small change in cost may have a big impact on

price.
132
Price versus Cost
 Manufacturing Cost is
 the total amount spent to produce a component
 Component Cost relates to

 the cost at which the components are available to the
designer.
 It ranges from 40% to 50% of the List Price of the product.
 Recurring costs include

 Labor, purchasing scrap, warranty (4% - 16%) of list price.
 Gross margin include non-recurring cost, i.e.

 R&D, marketing, sales, equipment, rental, maintenance,
financing cost, pre-tax profits, taxes
List Price is the amount for which the finished good is sold and it includes Average
Discount of 15% to 35% of the as volume discounts and/or retailer markup. 133
Price versus Cost
134
Chip Manufacturing Process & Stages
 Due to Silicon and Silicon chip we were successful in building LSI, VLSI and ULSI chips.
 Silicon itself is obtained from sand, and initially Silicon is of cylindrical shape.
Blank wafers
20 to 40
Silicon ingot Slicer processing
steps
Bond die Wafer

Dicer
to Package Tester
Part Ship to
Tester Customers
 The IC chip manufacturing passes through the following stages:

 Wafer growth and testing
 Wafer chopping into dies
 Packaging the dies to chips
 Testing a chip. 135
Die and its Cost
 Die
 Is the square area of the wafer containing the Integrated Circuit
 We can see in the diagram below that while fitting dies on the wafer
the small wafer area around the periphery goes waist
 Cost of a die is determined from the

 Cost of a wafer
 The number of dies fit on a wafer and
 The percentage of dies that work, i.e., the yield of the die
Cost of die = cost of wafer

dies per wafer x die yield 136
Cost of Integrated Circuits
 The cost of IC can be determined as the ratio of the total cost; i.e.,
 The sum of the costs of die,
 Cost of testing die,
 Cost of packaging and
 The cost of final testing a chip; to the final test yield
 The number of dies per wafer is determined by dividing the wafer area
(minus the waste wafer area near the round periphery/edge) by the die
area.
Cost of an IC = die cost + die testing cost + packaging cost + final testing cost
final test yield
Cost of die = cost of wafer

dies per wafer x die yield
Dies per wafer = π (wafer diameter/2)2 - π (wafer diameter)

die area √ 2 x die area
137
Example: Calculating number of Dies
 Find the number of dies per 30 cm wafer for a die that is 0.7
cm on a side.
 Answer:
Dies per wafer = π (wafer diameter/2)2 - π (wafer diameter)
die area √ 2 x die area
Dies per wafer = π (30/2)2 - π (30)
0.49 √ 2 x 0.49
Dies per wafer = π (15)2 - π (30)
0.49 √ 0.98
Dies per wafer = 3.14 x 225 - 3.14 x 30
0.49 0.99
Dies per wafer = 706.5 - 94.2
0.49 0.99
Dies per wafer = 1441 - 95
Dies per wafer = 1346 dies 138

Aca 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aca 1

Uploaded by

Copyright:

Available Formats

What This Course Is About?

Understand modern computer architecture so we can:

Computer Architecture is at the top of Computer Organization and Design

Amdahl, Blaauw, and Brooks, 1964

“Computer architecture is a science of trade-offs.”

Organization of a computer showing five classical components 8

 The processor gets instructions and data from

Compiler Operating System

CPU Memory I/O Coprocessor

 Optimizing the design requires expertise from

Massively Parallel Processors

 Main emphasis in Desktop computing is on

 In Desktop based machines we are more concerned

 MTBF calculation example:

 MTBF calculated is : 4 hours

 Failure Rate is a common tool to use when planning and

 Failure rate calculation:

 Taking too long to repair a system or equipment is

 MTTR takes the downtime of the system and divides it by the

 Availability = [4 / (4+2) x 100] = 66.66%

 Two other equally important constraints are:

 Another important aspect for Designer in embedded

 0.520u 0.52 seconds of user time

 Based on their compositions, benchmarks can be

 Response time is normally calculated for a single

Execution time of Machine B

CPU clock cycles for a program = CPI * IC

 Now the above formula can be used to improve the

 Now using the units of measurement, CPU-time can

 Suppose Execution time of We assume that all above 5 instructions

 Execution time of A = CPI * IC * clock cycle time

Execution time of Machine B

 So, Machine B is 1.5 times slower than Machine A, or

Time for Fraction F to be Execution time of the Fraction

 The overall speedup is the ratio of the execution times.

 More % of the fraction enhanced, more will be the effect in

 More % of the fraction enhanced, more will be the effect in

 More % of the fraction enhanced, more will be the effect in

 More % of the fraction enhanced, more will be the effect in

 Memory Access and Write Back are the after effects of

IF ID IE MA WB 1st Instruction completed in 5th Cycle

2nd Instruction completed in 10th Cycle

3rd Instruction completed in 15th Cycle

4th Instruction completed in 20th Cycle

5th Instruction completed in 25th Cycle

2nd Instruction Enters 1 2 3 4 5

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

3rd Instruction Enters 1 2 3 4 5

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 3rd Instruction completed in 15th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 3rd Instruction completed in 15th Cycle

4th Instruction Enters 1 2 3 4 5

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 3rd Instruction completed in 15th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 3rd Instruction completed in 15th Cycle

1 2 3 4 5 2nd Instruction completed in 10th Cycle

1 2 3 4 5 3rd Instruction completed in 15th Cycle