You are on page 1of 42

Von Neumann Architecture,

 Programs are stored


on storage devices
 Programs are copied
into memory for
execution
 CPU reads each
instruction in the
program and
executes accordingly
Von Neumann/Turing
Stored Program Computer
 ALU capable of operating on binary data
 Both ALU & CU contain registers.
Princeton Institute for Advanced
Studies (IAS)
 First implementation of von Neumann
stored program computer – the IAS
computer
 Began in 1946
 Completed in 1952
Structure of IAS machine
IAS Memory

1000 x 40 bit words of either number or


instruction
 Signed magnitude binary number
 1 sign bit
 39 bits for magnitude
 2 x 20 bit instructions
 Left and right instructions (left executed first)
 8-bit opcode
 12 bit address
IAS Registers
 Set of registers (storage in CPU)
 Memory Buffer Register (MBR)
 Memory Address Register (MAR)
 Instruction Register (IR)
 Instruction Buffer Register (IBR)
 Program Counter (PC)
 Accumulator (AC)
 Multiplier Quotient (MQ)
IAS Registers
 Memory buffer register (MBR): Contains a word
to be stored in memory or sent to the I/O unit, or is
used to receive a word from memory or from the I/O
unit.
 Memory address register (MAR): Specifies the
address in memory of the word to be written from or
read into the MBR.
 Instruction register (IR): Contains the 8-bit
opcode instruction being executed.
IAS Registers
 Instruction buffer register (IBR): Employed to hold
temporarily the right-hand instruction from a word in
memory.
 Program counter (PC): Contains the address of the
next instruction-pair to be fetched from memory.
 Accumulator (AC) and multiplier quotient (MQ):
Employed to hold temporarily operands and results of
ALU operations. For example, the result of multiplying two
40-bit numbers is an 80-bit number; the most significant
40 bits are stored in the AC and the least significant in the
MQ.
Structure of
IAS

Figure 2.3, p. 22
Moore’s Law

 Gordon Moore - cofounder of Intel


 He observed (based on experience) that number of
transistors on a chip doubled every year
 Since 1970’s growth has slowed a little
 Number of transistors doubles every 18 months
 Cost of a chip has remained almost unchanged
 Higher packing density means shorter electrical paths,
giving higher performance
 Smaller size gives increased flexibility/portability
 Reduced power and cooling requirements
 Fewer system interconnections increases reliability
Growth in CPU Transistor Count
Effects of Moore’s Law
The doubling of the number of transistors on a
single chip every 18 months has had some effects on
the application of technology:
 Costs have fallen dramatically since chip prices have not
changed substantially since Moore made his prediction
 Tighter packaging has allowed for shorter electrical paths
and therefore faster execution
 Smaller packaging has allowed for more applications in
more environments
 Reduction in power and cooling requirements which also
helps with portability
 Solder connections are not as reliable, therefore, with
more functions on a single chip, there are fewer unreliable
solder connections
Effects of Moore’s Law (continued)

As technology allows for higher levels of


performance, processor designers must come
up with ways to use it.
 Keeping all parts of the processor busy
 Coordinating multiple pipelines
 Improved branch prediction
 Multiple processors
 Optimizing execution
 Real-time analysis of code to “re-order” execution
 Speculative execution of code
 Incorporating multiple functions on single chip
Performance Mismatch

 Experienced significant improvement


 Processor speed
 Memory capacity
 Experienced only minor improvement
 Memory speed
 Bus rates
 I/O device performance
Speeding it up
 Pipelining
 On board cache
 On board L1 & L2 cache
 Branch prediction
 Data flow analysis
 Speculative execution
Branch Prediction
 The processor looks ahead in the instruction code
fetched from memory and predicts which branches,
or groups of instructions, are likely to be processed
next. If the processor guesses right most of the
time, it can prefetch the correct instructions and
buffer them so that the processor is kept busy. The
more sophisticated examples of this strategy predict
not just the next branch but multiple branches
ahead. Thus, branch prediction increases the
amount of work available for the processor to
execute.
Data Flow Analysis
 The processor analyzes which instructions are
dependent on each other’s results, or data, to
create an optimized schedule of instructions. In
fact, instructions are scheduled to be executed
when ready, independent of the original program
order. This prevents unnecessary delay.
Speculative Executoin
 Using branch prediction and data flow analysis,
some processors speculatively execute
instructions ahead of their actual appearance in
the program execution, holding the results in
temporary locations. This enables the processor
to keep its execution engines as busy as possible
by executing instructions that are likely to be
needed.
Performance Balance (Mismatch?)
 Processor speed increased
 Memory capacity increased
 But not the speed
 Thus, Memory speed lags behind
Processor speed
Logic and Memory Performance Gap
Solutions
 Increase number of bits retrieved at one time
 Change DRAM interface
 Cache

 Reduce frequency of memory access


 More complex cache and cache on chip

 Increase interconnection bandwidth


 High speed buses

 Hierarchy of buses
I/O Devices
 Peripherals with intensive I/O demands
 Large data throughput demands
 Processors can handle this
 Problem moving data
 Solutions:
 Caching

 Buffering

 Higher-speed interconnection buses

 More elaborate bus structures

 Multiple-processor configurations
Key is Balance
 Processor components
 Main memory
 I/O devices
 Interconnection structures
Improvements in Chip Organization
and Architecture
 Increase hardware speed of processor
 Fundamentally due to shrinking logic gate size

 More gates, packed more tightly, increasing clock

rate
 Propagation time for signals reduced

 Increase size and speed of caches


 Dedicating part of processor chip

 Cache access times drop significantly

 Change processor organization and architecture


 Increase effective speed of execution

 Parallelism
Increased Cache Capacity
 Typically two or three levels of cache
between processor and main memory
 Chip density increased
 More cache memory on chip
 Faster cache access
 Pentium chip devoted about 10% of
chip area to cache
 Pentium 4 devotes about 50%
More Complex Execution Logic
 Enable parallel execution of instructions
 Pipeline works like assembly line
 Different stages of execution of different
instructions at same time along pipeline
 Superscalar allows multiple pipelines
within single processor
 Instructions that do not depend on one
another can be executed in parallel
Diminishing Returns
 Internal organization of processors complex
 Can get a great deal of parallelism

 Further significant increases likely to be

relatively modest
 Benefits from cache are reaching limit
 Increasing clock rate runs into power
dissipation problem
 Some fundamental physical limits are being

reached
New Approach – Multiple Cores
 Multiple processors on single chip
 Large shared cache
 Within a processor, increase in performance
proportional to square root of increase in complexity
 If software can use multiple processors, doubling
number of processors almost doubles performance
 So, use two simpler processors on the chip rather than
one more complex processor
 With two processors, larger caches are justified
 Power consumption of memory logic less than processing logic
Performance Assessment
 Performance is one of the key
parameters to consider, along with
 cost,
 size,
 security,
 reliability, and,
 power consumption.
Performance Assessment
 Raw speed is far less important than how a
processor performs when executing a given
application.
 Application performance depends not just on the
raw speed of the processor, but on the
 instruction set, choice of implementation language,
efficiency of the compiler, and skill of the programming
done to implement the application.
System Clock
Performance Assessment: Clock Speed
 Key parameters
 Performance, cost, size, security, reliability, power
consumption
 System clock speed
 In Hz or multiples of (pulse frequency produced by the

clock)
 Clock rate, clock cycle, clock tick, cycle time

 Signals in a CPU take time to settle down to 1 or 0


 Some signals may change at different speeds
 Computer operations need to be synchronised
 Instruction execution in done in discrete steps:
 Fetch, decode, load and store, arithmetic or logical

 Usually require multiple clock cycles per instruction

 Pipelining gives simultaneous execution of instructions


 So, clock speed does not portray the complete picture for
different processors
Performance Assessment: Clock Speed

 A 1-GHz Processor receives I billion pulses per


second.
 Clock Rate/Clock Speed: The rate of pulses
 Cycle Time: the time duration between pulses
Instruction Execution Rate
 A processor is driven by a clock with a constant
frequency f , or
 1. a constant cycle time
 2. Ic = Instruction Count is the number of
machine instructions executed for that program
until it runs to completion or for some defined
time interval (‘executed instructions’?)
 3. CPI = average cycles per instruction
 Is CPI a constant value for a processor?
 Why ‘average’?
Instruction Execution Rate
 On any give processor, the number of clock
cycles required varies for different types of
instructions, such as load, store, branch etc.
 Let CPIi be the number of cycles required for
instruction type i and Ii be the number of
executed instructions of type i for a given
program
 The overall CPI is as:
Instruction Execution Rate
 The processor time T needed to execute a given
program can be expressed as: T = I c * CPI *
 Refinement in this formula is based on the fact the
memory related processing (memory references) take
more time as compared to processing done by the CPU
 Rewriting: T = Ic * [p + (m * k)] *
 Where; p = number of processor cycles needed to
decode and execute the instruction,
 m = number of memory references needed,
 k is the ratio between memory cycle time and processor
cycle time.
Performance Factors & System Attributes
 The five performance factors in the preceding
equation (Ic, p, m, k, ) are influenced by four
system attributes:
 the design of the instruction set (known as
instruction set architecture),
 compiler technology (how effective the compiler is in
producing an efficient machine language program
from a high-level language program),
 processor implementation, and
 cache and memory hierarchy.
Performance Factors & System Attributes
MIPS
 Millions of instructions per second (MIPS)
 Millions of floating point instructions per
second (MFLOPS)
 Heavily dependent on instruction set,
compiler design, processor
implementation, cache & memory
hierarchy
MIPS rate
 MIPS rate in terms of the clock rate and CPI
as follows:
MIPS
 Consider the execution of a program which results in the
execution of 2 million instructions on a 400-MHz
processor. The program consists of four major types of
instructions. The instruction mix and the CPI for each
instruction type are given below based on the result of a
program trace experiment:
MIPS
 The average CPI when the program is
executed on a uniprocessor with the above
trace results is:

 The corresponding MIPS rate is:

You might also like