Professional Documents
Culture Documents
1
Design, Organization & Architecture
Computer Design
Using components, i.e. what type of components will be required
and how many components will be needed etc is the scope of
Computer Design and is actually the responsibility of Computer
Designer.
Computer Design is concerned with the logical design of the
computer according to the specifications.
Computer Design specifies that how basic digital circuitry will be
established.
Computer Organization
Interconnecting different designed devices that may consist of
many components is the field of Computer Organization and will
be performed by Computer Organizers.
Computer Architecture
Computer Architecture sits at the top of Computer Organization
and Design.
Computer Architecture is the study of organizational and design
issues of a Computer System and is available for use by
Researcher, Scientists and for all of them who want to work as an
Architect for a computer system.
2
Computer Architecture
Computer
Architecture CA
Computer
CO
Organization
Computer Design CD
3
Computer Architecture is …
Computer Architecture is …
“…the structure of a computer that a machine language programmer
must understand to write a correct (timing independent) program for that
machine.”
Computer Architecture is …
“…the interface between what technology can provide and what the
marketplace demands.”
Computer Architecture is …
“…the science and art of selecting and interconnecting hardware
components to create computers that meet functional, performance and
cost goals.”
Computer Architecture …
“…forms the bridge between application need and the capabilities of the
underlying technology.”
5
Computer Architecture
We cannot architect a new computer without defining
performance, power and cost goals. The design
process is all about understanding and making trade-
offs.
What is our target market and what applications will
we be running?
The “best” architecture is a moving target
The needs of the marketplace change
Shifting fabrication technology characteristics
New technologies
memory, packaging, compiler, languages, ...
6
Computer Architecture
Why study computer architecture?
After all, I just buy a computer whose architecture has already
been designed by smart people and that I cannot change
Reason #1: One day you may design computers
not very likely
Reason #2: You will care about software performance
Very likely
Only by understanding architecture can one design fast software
It’s not all about having faster clock rates
Viewing the computer as a “black box” is not enough
Reason #3: You will make decisions for hardware
purchases
Very likely
And these decisions will have to be justified to your boss
Reason #4: Some fundamental concepts of computer
science arise in computer architecture
For instance in systems: caching, paging
7
Computer System Components
All Computers consist of Five Components
Processor
Datapath
Control
Memory
I/O
Input devices
Output devices
User Application
Language Subsystems Utilities
Hardware Organization
10
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) defines/specifies all
the components and operationsof a computer that
can be used by a programmer:
Memory Organization
Address Space - How many locations can be addressed
Addressability - How many bits per location can be accessed
Register Set
How many registers are there
What is the size of registers
What type of registers are there
Instruction Set
OPcodes
Data types
Addressing modes
11
Instruction Set Architecture (ISA)
ISA provides all information needed for someone that
wants to write a program in machine language (or
translate from a high-level language to machine
language).
ISA is an interface between software and hardware.
12
Task of Computer Designer
Determine what attributes are important for a new
machine; and then design a machine to:
Maximize
Performance (while staying within cost constraints)
Minimize
Cost and
Power constraints
Aspects of this task include:
Instruction Set Design
Functional Organization
Logic Design and Implementation
15
Microprocessors 1960s to date
Mainframe Work-
Server station PC
Supercomputer Mini-
supercomputer
16
Computing Markets
The evolution in the Computer Industry have led to
three Computing Markets.
Desktop Computing
Servers specific Machines
Embedded Computers
17
Desktop Computing
Desktop based machines/computers include
Low-end systems to high-end workstations.
18
Server Specific Machines
With the advent of desktop computing the
importance of servers for reliable file and computing
services grew.
Emergence of web based structures made the
requirement of servers very strong.
Performance issues regarding server based machines
can be that:
How many users can be handled at a particular instant of
time
How many programs can be handled per unit time by the
server
19
Server-Specific Machines
Emphasis in server based machines is on:
Availability
Is an important criteria and is in a sense reliability
MTTF (Mean Time to Fail)
– Lifespan of a device, (Should be high, increased time)
MTBF (Mean Time Between Failure)
– Time between two failures, (Should be high, increased time)
MTTR (Mean Time To Repair)
– Repair time after a failure, (Should be low, i.e. reduced time)
Response Time
How many users get the Response from Server at a particular instance
of time.
Scalability
Another key feature (memory, computing capacity, Storage, I/O bandwidth
of a server.
Throughput
Requests handled in a unit time.
Should be high.
20
Uptime Calculation
A website is monitored during 24 hours (which
translates to 86,400 seconds), and in that time frame
the website went down for 10 minutes (600
seconds). To define the uptime and downtime
percentages, we perform the following calculation:
Total number of seconds for which website was down:
Down time = 600 seconds
Total number of seconds for which website was monitored:
Total time = 86,400 seconds
Divide 600 by 86,400, which is 0.0069.
In percentages, this is 0.69% which corresponds to downtime
percentage.
The uptime percentage for this website would be:
Uptime = 100% minus 0.69% = 99.31%.
21
MTTF (Mean Time To Fail)
Mean Time To Fail (MTTF) is a very basic measure of
reliability used for non-repairable systems.
MTTF represents the length of time that an item is expected
to last in operation until it fails.
MTTF is commonly referred to as the lifetime of any
product or a device.
MTTF value is calculated by looking at a large number of the
same kind of items over an extended period of time and
seeing what is their mean time to failure.
When used MTTF as a failure metric, repair of the
asset is not an option.
22
MTTF (Mean Time To Fail)
Let’s assume that we tested three identical devices
until all of them failed.
The first device failed after eight hours,
The second one failed after ten hours, and
The third device failed after twelve hours.
Now MTTF of this device type would be:
(8 + 10 + 12) / 3 = 10 hours
Above example leads us to a conclusion that a
particular type and model of that device needs to be
replaced, on average, every 10 hours.
MTTF is an important metric used to estimate the
lifespan of devices that are not repairable.
Shorter MTTF means more frequent downtime and
disruptions that portrays that device is not Reliable.
23
MTBF (Mean Time Between Failure)
MTBF (Mean Time Between Failure)
Mean Time Between Failure is a measure of total uptime of
the components(s) divided by the total number of failures.
Formula for calculating MTBF is:
24
MTBF (Mean Time Between Failure)
25
Failure Rate
Failure Rate
Failure Rate is a simple calculation derived by taking the
inverse of the Mean Time Between Failures:
26
MTTR (Mean Time To Repair)
Mean Time To Repair (MTTR) refers to the amount of
time required to repair a system and restore it to full
functionality.
28
Availability
Availability is an important metric used to assess the
performance of repairable systems.
Availability refers to the capability of the system for
which it remains available for service. There are
different classifications of availability, including:
Instantaneous (or Point) Availability
Average Uptime Availability (or Mean Availability)
Steady State Availability
Inherent Availability
Achieved Availability
Operational Availability
Formula for calculating Availability of a system is:
29
Availability
Availability calculation example
Total uptime = 8
Number of failures = 2
MTBF = 4
Failure rate = 0.25
Downtime = 4
Number of failures = 2
MTTR = 2
30
Embedded Systems
An embedded system is a computer system designed
to perform one or a few dedicated functions, often
with real-time computing constraints.
Embedded systems control many of the common
devices in use today, i.e. Microwaves, Washing
machines, Printers, Network switches, Cars,
Palmtops, Cell phones etc.
Some embedded computers are programmable, but
more crucial are those where Programming is done
on initial loading or up-gradation.
There is a great diversity in embedded computers
from 8-bit microprocessors to 32-bit microprocessors
capable of executing several Million instructions per
second used in Network Switches.
31
Embedded Systems
There are two important constraints in Embedded
computers:
Performance in real time
Minimum Cost
32
Performance Basics:
Execution Time vs Throughput
How to measure that which Computer is faster out of
two?
Execution Time (response time, latency,… )
Time to complete one task
Throughput (bandwidth)
Tasks per unit of time (sec, hour, day,… )
How to calculate that Machine X is ?? times faster
than Machine Y
Performance and Execution time are reciprocals of
each other, so increase in performance results
decrease in execution time. 1
Performance =
Execution-Time 33
Execution Time
How does one define execution time?
sounds easy but...
The most common understanding is somewhat known as the
wall-clock time
Start a timer and launch a program at the same instant
When the program completes, stop the timer
The elapsed time is the wall-clock time
The wall-clock time includes all activities of the program
Another metric is CPU time
Wall-clock time NOT including time spent waiting for I/O, or time spent
waiting for the CPU to be free due to multiple processes running at the
same time
CPU time has two components:
User time: time spent in the program itself
System time: time spent in the operating systems
e.g., when calling system calls
Wall-clock time > User time + System time 34
Execution Time Calculation: By Hand
One possibility would be to do this by just “looking”
at a clock, launching the program, “looking” at the
clock again when the program terminates
This of course has some drawbacks
Poor resolution
Requires the user’s attention
Only for wall-clock time
Therefore operating systems provide ways to time
programs automatically
UNIX/Linux provides the time command
35
The UNIX/Linux time Command
We can put time in front of any UNIX/Linux command we invoke
When the invoked command completes, time prints out timing (and
other) information, i.e.
$ % time ls /home/casanova/ -la -R
0.520u 1.570s 0:20.58 10.1% 0+0k 570+105io 0pf+0w
36
Linux time command example
[root@sajjad ~]# /usr/bin/time -v du /
Command being timed: "du /"
User time (seconds): 0.04
System time (seconds): 6.71
Percent of CPU this job got: 38%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.73
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 981
Voluntary context switches: 3869
Involuntary context switches: 53630
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
37
Performance checking on a Dedicated
System
Measuring the performance of a code must be done on a
quiescent/unloaded machine
The machine only runs the standard OS processes
The machine must be dedicated
No other user can start a process
The user measuring the performance only runs the minimum
amount of processes
basically, a shell
Nevertheless, one should always present measurement results
as averages over several experiments
Because the (small) load imposed by the OS is not deterministic
38
Drawbacks of UNIX/Linux time
command
The time command has poor resolution
“Only” milliseconds
Sometimes one wants a higher precision, especially if
performance improvements are in the 1-2% range
time times the whole code
Sometimes one is only interested in timing some part of the
code
Sometimes one wants to compare the execution time of
different sections of the code
39
Calculating Execution Time of a code
in Java
ntp_gettime() (Internet RFC 1589)
Sort of like gettimeofday, but reports estimated error on time measurement
Not available for all systems
Part of the GNU C Library
Java: System.currentTimeMillis()
Known to have resolution problems, with resolution higher than 1 millisecond!
Solution: use a native interface to a better timer
Java: System.nanoTime()
Added in J2SE 5.0
Probably not accurate at the nanosecond level
Tons of “high precision timing in Java” on the Web
40
Levels and Types of Benchmarks
A benchmark is a test that measures the
performance of hardware, software, or computer.
These tests can be used to help compare how well a
product may do against other products.
Based on the levels of performance they measure,
benchmarks can be grouped into two levels:
Component-Level Benchmarks
System-Level Benchmarks
41
Component-Level Benchmarks
Component-Level benchmarks test a specific component of a
computer system, such as the video board, the audio card, or
the microprocessor.
They are useful for selecting a component of a computer
system, that corresponds to a particular function.
Instead of testing the performance of the system running real
applications, component-level benchmarks focus on the
performance of subsystems within a system.
These subsystems may include arithmetic integer unit,
arithmetic floating-point unit, memory system, disk
subsystem, etc.
Examples of component-level benchmarks include:
SPECweb96 - measures the web server performance
GPC - measures graphics performance for displaying 3-D images
42
System-Level Benchmarks
System-Level benchmarks evaluate the overall performance of
a computer running real programs or applications.
These benchmarks are useful when comparing systems of
different architectures.
They take each subsystem into account, and indicate the
effect of each subsystem on the overall performance.
Examples of system-level benchmarks include:
SYSmark/NT 4.0 - measures the performance of computers running
popular business applications under Windows NT 4.0
TPC-C - measures the performance of transaction processing systems
43
Synthetic Benchmarks
Synthetic benchmarks are component-level benchmarks, and
they evaluate a particular capability of a subsystem. For
example, a disk subsystem performance benchmark may
combine a series of basic seek, read, and write operations
involving varying numbers of disk blocks of varying sizes.
When evaluating the results from synthetic benchmarks, the
following rules should be followed:
Understand the composition of the benchmark,
Understand the factors contributing to the results,
Examples of synthetic benchmarks include:
WinBench 97 - measures the performance of a PC's graphics, disk,
processor, video and CD-ROM subsystems in Windows environment
MacBench 97 - measures the processor, floating-point, graphics, video,
and CD-ROM performance of a MAC OS system
44
Application Benchmarks
Application benchmarks employ actual application programs.
Most application benchmarks are system-level benchmarks,
and they measure the overall performance of a system.
When an application benchmark is run, it tests the
contribution of each component of the system to the overall
performance.
These benchmarks are usually larger and difficult to execute
and are not useful for measuring future needs.
Examples of application benchmarks include:
Winstone 97 - tests a PC's overall performance when running Windows-
based 32-bit business applications
SYSmark/NT 4.0 - measures the performance of computers running
popular business applications under Windows NT 4.0
45
Benchmark Suites
Typically, many benchmarks are packaged together in “suites”
Most famous/successful benchmarks results are from SPEC
(Standard Performance Evaluation Corporation)
Cover different types of applications
www.spec.org
There are other benchmarks
Business Winstone
CC Winstone
Winbench
46
SPEC Benchmark - Report
Source:
http://www.spec.org/cloud_iaas2018/results/res2019q4/cloudiaas2018-20191113-00005.html
47
Interpreting Benchmark Results
Let’s say your boss asks you to buy a server for hosting a Web server and
a database server for 8K
Your must find the server with the best performance for that price
Now you know that you should look for
SPEC (Standard Performance Evaluation Group www.spec.org ) and
TPC (Transaction processing Performance Council www.tpc.org )
results in magazine or in documentation provided by the vendors
Question: how to interpret published benchmark results?
A few general principles
Pay attention to the software configuration
And in particular to OS
many benchmarks run in single-user mode or with “enhanced” OS configurations
Pay extreme attention to the compiler technology
48
Obtaining Benchmark Results
No matter what, people running benchmarks can always
Play unforeseen tricks
Fail to report on the full system information
For this reason, benchmarks come with different requirements
No source code modification allowed (SPEC)
just pay attention to the software configuration and the compiler
Source code modifications allowed, but are difficult (TPC-C)
use of proprietary software, OS difficult to modify
Source code modifications allowed
supercomputer benchmarks
EEMBC (so-called “optimized results”)
Hand-coding allowed (EEMBC)
allows assembly code to be written (can speed up code by up to 80x)
49
Summarizing Benchmark Results
Benchmark results are often misleading
Incomplete information on how the results were obtained
Underhanded tactics by companies and designers
To make things worse, benchmark suites contain
multiple programs, so what is reported is the
performance of a system on a variety of programs
Question: how do we summarize those results?
50
Measuring and Reporting Performance
For a
Desktop Computer user
A computer will be faster when a Program runs in less time.
Large Server Manager
A computer will be faster when more jobs will be completed in less
time.
– In this environment for a User the Less response time will be the
criteria of performance.
Large Data Processing environment/system
Overall system will be accredited on its throughput, i.e.
– Total amount of work done in a given time.
51
Measuring and Reporting Performance
So we can define these metrics in Mathematical
terminology as: 1
Performance =
Execution-Time
As Execution time is the reciprocal of Performance,
so it can be written as:
Performance of Machine A
n=
Performance of Machine B
1 / Execution time of A
n=
1 / Execution time of B
52
The CPU Performance Equation
Following is the CPU Performance equation:
CPU Time = CPU clock cycles for a program * Clock cycle time
Program
* Instruction
* Clock cycle
* Program
= CPU time
53
CPU Performance & Time
So it illustrates that the CPU performance and CPU
time is dependent on three characteristics:
clock cycle time (or rate)
clock Cycles Per Instruction (CPI)
Instruction Count (IC)
54
CPU Performance & Time
Performance of Machine A
n=
Performance of Machine B
1 / Execution time of A
n=
1 / Execution time of B
15
n=
10
n= 1.5
56
Quantitative Principles of Computer
Design
After defining, measuring and summarizing performance, we’ll
explore some principles that are useful in Design and Analysis of
Computers.
So, the answer to the question that “What are the main principles
people use for designing computers in order to gain high
performance” is to use the persistent rule in Computer Design, i.e.
“make the common case fast”,
i.e. favor the frequent case over the infrequent one.
When adding two numbers in CPU, we can expect overflow to be a
rare case and can therefore improve performance by optimizing the
more common case of no overflow.
Performance can decrease, in case of overflow, but it is rare, i.e.
not the common case.
So, the overall performance will be improved by optimizing for the
normal case, i.e. common case. Often the common case is also the
simplest. 57
Amdahl’s Law
In order to apply make the common case fast, for
performance enhancement, we’ll have to decide:
What the frequent case is and
How much performance can be improved by making that case
faster.
A fundamental law, called Amdahl’s Law, can be used
to quantify this principle.
58
Amdahl’s Law
Amdahl’s Law defines the speedup that can be gained by
using a particular feature.
Speedup is the enhancement that we can make in a machine that
will improve performance when it is used.
Speedup = Performance for entire task using the enhancement when possible
Performance for entire task without using the enhancement
Alternatively,
Speedup = Execution time for entire task without using the enhancement
Execution time for entire task using the enhancement when possible
Speedup tells us how much faster a task will run using the
machine with the enhancement as opposed to the original
machine.
Amdahl’s Law gives us a quick way to find the speedup from
some enhancement, which depends on two factors:
The fraction of the computation time in the original machine that
can be converted to take advantage of the enhancement
The improvement gained by the enhanced execution mode
59
Amdahl’s Law
The fraction of the computation time in the original machine that can be
converted to take advantage of the enhancement, i.e. if 20 sec. of the execution time
of a program that takes 60 sec. in total can use an enhancement, the fraction is 20/60. This
value which we will call Fractionenhanced is always <= 1.
The improvement gained by the enhanced execution mode, i.e. how much faster the
task would run if the enhanced mode were used for the entire program. This value is the time
of the original mode over the time of the enhanced mode. If the enhanced mode take 2 sec.
for some portion of the program that can completely use the mode, while the original mode
took 5 sec. for the same portion, then the improvement is 5/2. We’ll call this value
Speedupenhanced which is always > 1,.
Original Execution
Execution Time
Time of Task after fraction F
Enhanced by factor S
60
Amdahl’s Law
Execution time using the original machine with the enhanced mode will be
the time spent using the unenhanced portion of the machine plus the time
spent using the enhancement.
Execution-Timenew = Execution-Timeold x 1 – Fractionenhanced + Fractionenhanced
Speedupenhanced
Speedupoverall Execution-Timeold 1
= =
Execution-Timenew 1 – Fractionenhanced Fractionenhanced
+
Speedupenhanced
61
Example # 01
Suppose we are considering an enhancement to the processor
of a server system used for web serving. The new CPU is 10
times faster on computation in the web serving applications
than the original processor. Assuming that the original CPU is
busy with computations 40% of the time and is waiting for
I/O 60% of the time. What is the overall speedup gained by
incorporating the enhancement?
Answer
Fractionenhanced = 0.4
Speedupenhanced = 10
1 1 1
Speedupoverall = = = = 1.56
1 - 0.4 + 0.4 0.6 + 0.4 0.64
10 10
62
Example # 02
A common transformation required in graphics engines is square root.
Implementations of floating-point square root vary significantly in performance,
especially among processors designed for graphics. Suppose FP square root
(FPSQR) is responsible for 20% of the execution time of a critical graphics
benchmark. One proposal is to enhance the FPSQR hardware and speed up this
operation by a factor of 10. The other alternative is just to try to make all FP
instructions in the graphics processor run faster by a factor of 1.6; FP instructions
are responsible for a total of 50% of the execution time for the application. The
design team believes that they can make all FP instructions run 1.6 times faster
with the same effort required for the fast square root. Compare these two design
alternatives.
These two alternatives can be compared using the Speedup comparisons:
1 1 1
SpeedupFPSQR = = = = 1.219
1 - 0.2 + 0.2 0.8 + 0.2 0.82
10 10
1 1 1
SpeedupFP = = = = 1.230
1 - 0.5 + 0.5 0.5 + 0.5 0.8125
1.6 1.6
63
Example # 03
Suppose we can improve Load/Store instructions by 5x, i.e. 5
times speedup, and we further suppose that 10% of the
instructions are Load/Store?
What is the Overall Speedup obtained in this scenario?
Answer
Fractionenhanced = 0.1
Speedupenhanced = 5
1 1 1
Speedupoverall = = = = 1.086
1 - 0.1 + 0.1 0.9 + 0.1 0.92
5 5
More % of the fraction enhanced, more will be the effect in
performance and this is the Amdhal’s Law.
64
Example # 04
Suppose we can improve Load/Store instructions by 5x, i.e. 5
times speedup, and we further suppose that 50% of the
instructions are Load/Store?
What is the Overall Speedup obtained in this scenario?
Answer
Fractionenhanced = 0.5
Speedupenhanced = 5
1 1 1
Speedupoverall = = = = 1.67
1 - 0.5 + 0.5 0.5 + 0.5 0.6
5 5
65
Example # 05
Suppose we can improve Load/Store instructions by 20x, i.e.
20 times speedup, and we further suppose that 10% of the
instructions are Load/Store?
What is the Overall Speedup obtained in this scenario?
Answer
Fractionenhanced = 0.1
Speedupenhanced = 20
1 1 1
Speedupoverall = = = = 1.10
1 - 0.1 + 0.1 0.9 + 0.1 0.90
20 20
66
Example # 06
Calculate overall speedup if we make 90% of a program run
10 times faster?
Answer
Fractionenhanced = 0.9
Speedupenhanced = 10
1 1 1
Speedupoverall = = = = 5.26
1 - 0.9 + 0.9 0.1 + 0.09 0.19
10
67
Example # 07
Calculate overall speedup if we make 80% of a program run
20 times faster?
Answer
Fractionenhanced = 0.8
Speedupenhanced = 20
1 1 1
Speedupoverall = = = = 4.16
1 - 0.8 + 0.8 0.2 + 0.04 0.24
20
68
Example # 08
What fraction of the computations should be able to use the
floating–point processor in order to achieve an overall
speedup of 2.25?
Answer
Fractionenhanced =?
Speedupenhanced = 15
1 15 15
2.25 = = =
1-F + F 15 – 15F + F 15 – 14F
15
2.25(15 – 14F) = 15
33.75 – 31.5F = 15
31.5F = 18.75
18.75
F = = 0.595 = 59.5 OR 60%
31.5
Fractionenhanced = 60%
69
Example # 09
We have a system that contains a special processor for doing
floating-point operations. We have determined that 50% of
our computations can use the floating-point processor. The
speedup of the floating pointing-point processor is 15.
What is the overall speedup achieved by using the floating-point processor?
Answer
Fractionenhanced = 0.5 AND Speedupenhanced = 15
1 1 1
Speedupoverall = = = = 1.876
1 - 0.5 + 0.5 0.5 + 0.033 0.533
15
What is the overall speedup achieved if we modify the compiler so that 75% of the
computations can use the floating-point processor?
Answer
Fractionenhanced = 0.75 AND Speedupenhanced = 15
1 1 1
Speedupoverall = = = = 3.33
1 - 0.75 + 0.75 0.25 + 0.05 0.3
15 70
Optimization (Improving Performance)
Hardware issues can not be optimized more as the
optimization mostly depends upon the software
issues that ultimately optimizes the performance.
Some ways of optimization are:
Locality
Temporal Locality
Spatial Locality
Parallelism
Parallelism at System Level
Parallelism at the level of an Individual Processor (Pipelining)
71
Properties to Exploit in Computer
Design: Locality
Although Amdhal’s Law is a theorem that applies to any
system, other important fundamental observations come from
properties of programs, and the most important program
property that we regularly exploit is Principle of Locality.
Programs tend to use data and instructions they have used
recently. A widely held rule of thumb is that a program spends
90% of its execution time in only 10% of the code.
An implication of locality is that we can predict with
reasonable accuracy that what instructions and data a
program will use in near future based on its accesses in the
recent past.
Temporal Locality states that recently accessed items are likely
to be accessed in the near future.
Spatial Locality states that items whose addresses are near to
one another tend to be referenced close together in time. 72
Principles of Locality
Locality is important because it allows us to predict
what the processor will do next with reasonable
accuracy based on what it did before
Temporal locality: having accessed a location , this
location is likely to be accessed again
Therefore, if one can keep recently accessed data items
“close” to the processor, then perhaps the next instructions
will find them ready for use.
Spatial locality: having accessed a location, a nearby
location is likely to be accessed next
Therefore, if one can bring in contiguous data items “close”
to the processor at once, then perhaps a sequence of
instructions will find them ready for use.
73
Locality: Example
Consider the following sequence of addressed
references by the CPU
1,2,1200,1,1200,3,4,5,6,1200,7,8,9,10
This sequence has “good Temporal Locality” as
location 1200 is referenced three time out of 14
references
This sequence has “good Spatial Locality” as
locations [1,2], [3,4,5,6] and [7,8,9,10] are
addressed in sequence
Like the elements of an array that are accessed one after the
other
74
Properties to Exploit in Computer
Design: Parallelism
One of the most important method of improving
performance is to take advantage of Parallelism.
Parallelism refers to an act of performing two or
more tasks or operations simultaneously.
Parallelism operates at two levels:
Hardware
Software
75
Types of Parallelism
Parallelism is broadly classified into two types:
Parallelism in Hardware
Parallelism in Uniprocessor environments
– Pipelining
– Superscalar, VLIW etc
Vector Processors, GPUs and SIMD Instructions etc
Parallelism in Multiprocessor environments
– Symmetric shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip multiprocessors, e.g. multicore architectures
Parallelism in Software
Instruction Level Parallelism (ILP)
Task Level Parallelism (TLP)
Data Parallelism
Transaction Level Parallelism
76
Pipelining
In Pipelining, the Processor is divided into functional units.
Division of a processor into 5 functional units is given below:
Instruction Fetch IF
Instruction Decode ID
Instruction Execute IE
Memory Access MA
Write Back WB
Write back
– Memory is updated after the complete execution of the block.
Write through
– Memory is updated immediately after the change in instruction.
78
Without Pipelining
1 2 3 4 5
80
Without Pipelining - 01
1st Instruction Enters 1 2 3 4 5
IF
1
81
Without Pipelining - 02
1 2 3 4 5
ID
1
82
Without Pipelining - 03
1 2 3 4 5
IE
1
83
Without Pipelining - 04
1 2 3 4 5
MA
1
84
Without Pipelining - 05
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
1
85
Without Pipelining - 06
1 2 3 4 5 1st Instruction completed in 5th Cycle
IF
2
86
Without Pipelining - 07
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
ID
2
87
Without Pipelining - 08
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
IE
2
88
Without Pipelining - 09
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
MA
2
89
Without Pipelining - 10
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
2
90
Without Pipelining - 11
1 2 3 4 5 1st Instruction completed in 5th Cycle
IF
3
91
Without Pipelining - 12
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
ID
3
92
Without Pipelining - 13
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
IE
3
93
Without Pipelining - 14
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
MA
3
94
Without Pipelining - 15
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
3
95
Without Pipelining - 16
1 2 3 4 5 1st Instruction completed in 5th Cycle
IF
4
96
Without Pipelining - 17
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
ID
4
97
Without Pipelining - 18
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
IE
4
98
Without Pipelining - 19
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
MA
4
99
Without Pipelining - 20
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
4
100
Without Pipelining - 21
1 2 3 4 5 1st Instruction completed in 5th Cycle
IF
5 101
Without Pipelining - 22
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
ID
5 102
Without Pipelining - 23
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
IE
5 103
Without Pipelining - 24
1 2 3 4 5 1st Instruction completed in 5th Cycle
1 2 3 4 5
MA
5 104
Without Pipelining - 25
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
5 105
Without Pipelining
1 2 3 4 5 1st Instruction completed in 5th Cycle
107
Pipelining
1 2 3 4 5
109
Pipelining - 01
1st Instruction Enters 1 2 3 4 5
IF
110
Pipelining - 02
1 2 3 4 5
ID
2nd Instruction Enters 1 2 3 4 5
IF
111
Pipelining - 03
1 2 3 4 5
IE
1 2 3 4 5
ID
3rd Instruction Enters 1 2 3 4 5
IF
112
Pipelining - 04
1 2 3 4 5
MA
1 2 3 4 5
IE
1 2 3 4 5
ID
4th Instruction Enters 1 2 3 4 5
IF
113
Pipelining - 05
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
1 2 3 4 5
MA
1 2 3 4 5
IE
1 2 3 4 5
ID
IF
114
Pipelining - 06
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
1 2 3 4 5
MA
1 2 3 4 5
IE
1 2 3 4 5
ID
115
Pipelining - 07
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
1 2 3 4 5
MA
1 2 3 4 5
IE
116
Pipelining - 08
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
1 2 3 4 5
MA
117
Pipelining - 09
1 2 3 4 5 1st Instruction completed in 5th Cycle
WB
With pipelining, it took 9 cycles to complete 5 instructions. 118
Pipelining
1 2 3 4 5
119
Pipelining
The concept of Pipelining is used in Uni Processor
environment.
Pipelining concept makes computing and processing
very complex.
Pipelining is not used in Multiprocessing environment
as we already have the hardware duplicates available
there, i.e. more no. of processors etc.
120
Computer Design Cycle
The computer design and developments have been under the influence of:
Technology
Performance
Cost
Technology
Performance
Cost
The new trends in technology have been used to reduce the cost and
enhance the performance
The decisive factors for rapid changes in the computer development have
been the performance enhancements, price reduction and functional
improvements.
121
Computer Design Cycle: Performance
The computer design is evaluated for bottlenecks
using certain benchmarks to achieve the optimum
performance.
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Performance
Technology and cost
122
Computer Design Cycle: Performance
Performance metrics are:
Time/Latency
Time is the key measurement of performance,
A desktop user may define the performance of his/her machine in
terms of time taken by the machine to execute a program; whereas
A computer center manager running a large server system may
define the performance in terms of the number of jobs completed
in a specified time.
Throughput
The desktop user is interested in reducing the response time or
execution time, i.e. time between the start and completion of an
event; while
A data processing center is interested in increasing the throughput,
i.e. the number of jobs completed per unit time
Other measures such as MIPS, MFLOPS, clock
frequency (MHz), cache size do not make any sense.123
Performance Measuring Tools
Benchmarks:
Benchmarks are the programs which try to average frequency of
operation and operands of a large set of programs.
Standardized benchmarks tools are available from Standard
Performance Evaluation Corporation – SPEC at www.spec.org
Hardware:
Cost, Delay, Area, Power consumption
Simulation at different levels of design abstraction, i.e.
ISA, RT, Gate, Circuit
Queuing Theory
To calculate the response time and throughput of entire I/O system
Rules of Thumb
Fundamental “Laws”/Principles
124
Computer Design Cycle: Technology
The Technology Trends motivate new designs.
The new designs are simulated to evaluate the performance for different
levels of workloads
Simulation helps in keeping the result verification cost minimum.
Performance
Evaluate Existing Systems for Bottlenecks using Benchmarks
Technology
Simulate
New Designs
Workloads
and
Organizations
128
Technology Trends: Processor
Performance
In 1980, the performance of supercomputers was 50 times that of microprocessor, 20 times
of minicomputers and 10 time that of mainframe computers.
Performance
Implementation
Complexity
Cost
Implement Technology
Next Generation
System
130
Cost, Price and their Trends
Although there are some Computer designs like
supercomputers etc, where cost doesn’t matter, cost
sensitive designs are of growing significance.
For the last 15-20 years, the use of technology
improvements to achieve lower cost, as well as
increased performance, has been a major theme in
the computer industry.
We know that
Price is what we sell a finished good for, and
Cost is the amount spent to produce it, including overhead.
Understanding of cost and its factors is essential for
designers to be able to make intelligent decisions
about whether or not a new feature should be
included in designs where cost is an issue.
(Imagine an architect designing skyscraper without any information on costs of steel and concrete etc) 131
Price versus Cost
The relationship between cost and price is complex.
Cost refers to
The total amount spend to produce a product.
132
Price versus Cost
Manufacturing Cost is
the total amount spent to produce a component
List Price is the amount for which the finished good is sold and it includes Average
Discount of 15% to 35% of the as volume discounts and/or retailer markup. 133
Price versus Cost
134
Chip Manufacturing Process & Stages
Due to Silicon and Silicon chip we were successful in building LSI, VLSI and ULSI chips.
Silicon itself is obtained from sand, and initially Silicon is of cylindrical shape.
Blank wafers
20 to 40
Silicon ingot Slicer processing
steps
Part Ship to
Tester Customers
Cost of an IC = die cost + die testing cost + packaging cost + final testing cost
final test yield