Professional Documents
Culture Documents
AND ARCHITECTURE
UNIT -I
By
Dr. G. Bhaskar
Associate Professor
Dept. of ECE
Email Id:bhaskar.0416@gmail.com
Outline:
2
COMPUTER ORGANIZATION AND ARCHITECTURE: SYLLABUS
UNIT - I
Digital Computers: Introduction, Block diagram of Digital Computer, Definition of
Computer Organization, Computer Design and Computer Architecture.
Register Transfer Language and Micro operations: Register Transfer language, Register
Transfer, Bus and memory transfers, Arithmetic Micro operations, logic micro operations,
shift micro operations, Arithmetic logic shift unit.
Basic Computer Organization and Design: Instruction codes, Computer Registers
Computer instructions, Timing and Control, 3
Instruction cycle, Memory Reference
Instructions, Input – Output and Interrupt.
UNIT - II
Microprogrammed Control: Control memory, Address sequencing, micro program
example, design of control unit.
Central Processing Unit: General Register Organization, Instruction Formats, Addressing
modes, Data Transfer and Manipulation, Program Control.
UNIT - III
Data Representation: Data types, Complements, Fixed Point Representation, Floating
Point Representation.
Computer Arithmetic: Addition and subtraction, multiplication Algorithms, Division
Algorithms, Floating – point Arithmetic operations. Decimal Arithmetic unit, Decimal
Arithmetic operations.
UNIT - IV
Input-Output Organization: Input-Output Interface, Asynchronous data transfer, Modes
of Transfer, Priority Interrupt Direct memory Access.
Memory Organization: Memory Hierarchy, Main Memory, Auxiliary memory,
Associate Memory, Cache Memory. 4
UNIT -V
Reduced Instruction Set Computer: CISC Characteristics, RISC Characteristics.
Pipeline and Vector Processing: Parallel Processing, Pipelining, Arithmetic Pipeline,
Instruction Pipeline, RISC Pipeline, Vector Processing, Array Processor.
Multi Processors: Characteristics of Multiprocessors, Interconnection Structures, Inter
processor arbitration, Inter processor communication and synchronization, Cache
Coherence.
TEXT BOOK:
1. Computer System Architecture – M. Morris Mano, Third Edition, Pearson/PHI.
REFERENCE BOOKS:
1. Computer Organization – Carl Hamacher, Zvonks Vranesic, SafeaZaky, Vth Edition,
McGraw Hill.
2. Computer Organization and Architecture – William Stallings Sixth Edition,
Pearson/PHI.
3. Structured Computer Organization – Andrew S. Tanenbaum, 4th Edition ,PHI/Pearson
5
Course Outcomes:
6
7
UNIT - I
Digital Computers: Introduction, Block diagram of Digital Computer, Definition of
8
Digital Computers: Introduction
The digital computer is a digital system that performs various computational
tasks.
The word digital implies that the information in the computer is represented by
variables that take a limited number of discrete values.
These values are processed internally by components that can maintain a
limited number of discrete states.
The decimal digits 0, 1, 2, . . . , 9, for example, provide 10 discrete values.
Digital computers use the binary number system, which has two digits: 0 and 1.
A binary digit is called a bit. Information is represented in digital computers in
groups of bits.
By using various coding techniques, groups of bits can be made to represent
not only binary numbers but also other discrete symbols, such as decimal digits
or letters of the alphabet.
9
Block diagram of Digital Computer:
11
Block Diagram of a Digital Computer
COMPUTER ORGANISATION AND
ARCHITECTURE
• IC technology improved
• Improved IC technology helped in designing low cost, high
speed processor and memory modules
• Multiprogramming, pipelining concepts were incorporated
• DOS allowed efficient and coordinate operation of computer
system with multiple users
• Cache and virtual memory concepts were developed
• More than one circuit on a single silicon chip became
available
FOURTH GENERATION (1975-1985)
Slide 20
COMPUTER TYPES
SUPER COMPUTERS
Processor
Input
Control
Memory
ALU
Output
Memory Arithmetic
Input Instr1 & Logic
Instr2
Instr3
Data1
Output Data2 Control
• Output Unit
• Memory
• Bus Structure
INPUT UNIT:
OUTPUT UNIT:
Memory
Keyboard
Audio input
Input Unit
……
Processor
36
Output unit
•Computers represent information in a specific binary form. Output units:
- Interface with output devices.
- Accept processed results provided by the computer in specific binary form.
- Convert the information in binary form to a form understood by an
output device.
Memory Printer
Graphics display
Speakers
……
Output Unit
Processor
37
CPU
•The “brain” of the machine
39
Control unit
40
MEMORY
•Two types are RAM or R/W memory and ROM read only memory
•ROM is used to store data and program which is not going to change.
42
Memory unit (contd..)
• Processor reads/writes to/from memory
based on the memory address:
– Access any word location in a short and fixed amount of time based on the
address.
– Random Access Memory (RAM) provides fixed access time independent of
the location of the word.
– Access time is known as “Memory Access Time”.
• Memory and processor have to
“communicate” with each other in order to
read/write information.
– In order to reduce “communication time”, a small amount of RAM (known as
Cache) is tightly coupled with the processor.
• Modern computers have three to four levels of RAM units with different speeds
and sizes:
– Fastest, smallest known as Cache
– Slowest, largest known as Main memory.
43
Memory unit (contd..)
44
MEMORY LOCATIONS AND ADDRESSES
•Main memory is the second major subsystem in a
computer. It consists of a collection of storage locations,
each with a unique identifier, called an address.
0 0 0 0th Location
0 0 1 1st Location
0 1 0
W/R
CS RD 0 1 1
A0 PROCESSOR
A1 1 0 0
A2
1 0 1
ADDRESS BUS
1 1 0
D7 D0
D0 D7
1 1 1
DATA BUS
Cont:-
•Clock speed
62
Information in a computer -- Data
63
How are the functional units connected?
•For a computer to achieve its operation, the functional units need to
communicate with each other.
•In order to communicate, they need to be connected.
Bus
64
Organization of cache and main
memory
Main Cache
memory memory Processor
Bus
Why is the access time of the cache memory lesser than the
access time of the main memory?
65
Computer Components: Top-Level View
Basic Operational Concepts
Computer Systems and Their Parts
Computer
Analog Digital
Fixed-function Stored-program
Electronic Nonelectronic
General-purpose Special-purpose
Slide 68
REGISTER TRANSFER LANGUAGE
• The symbolic notation used to describe the micro-operation transfer
among registers is called RTL (Register Transfer Language).
● Eg:
○ CLA - 7800 : Clear AC
○ CLE - 7400 : Clear E
I/O instructions
● These instructions are needed for transfering
informations to and from AC register.
● Recognized by the opcode 111 and a 1 in the
15th bit.
● Bits 0-11 specify the type of I/O Operation
performed.
● Eg:
○ INP - F800 : Input characters to AC
○ OUT - F400 : Output characters from AC
Instruction Set Completeness
● The set of instructions are said to be
complete if the computer includes a sufficient
number of instructions in each of the
following categories
a. Arithmetic, logical, and shift instructions
b. Instructions for moving information to and from
memory and processor registers
c. Program control instructions
d. Input and output instructions
Timing and
Control
Timing and Control
● The timing for all registers in the basic computer
is controlled by a master clock generator.
● The clock pulses are applied to all flip-flops and
registers.
● The clock pulses do not change the state of a
register unless the register is enabled by a
control signal.
● Two major types of control organization:
○ hardwired control
○ microprogrammed control.
Example
Add R1, R2
T1 Enable R1
T2 Enable R2
T4
•Control unit works with a
reference signal called
T1
processor clock
R1 R2
•Each basic step is
executed in one clock
cycle
R2
Hardwired Control Microprogrammed Control
particular task.
● Microinstructions can be saved by employing subroutines
001 AC ← AC DR XOP
010 AC ← AC CM
101 PC ← PC 1 INCPC
110 PC ← AR ARTPC
111 Reserved
CD Conditio Symbol Comments BR Symbol Function
n
00 JMP CAR ← AD if condition = 1
00 Always 1 U Unconditional CAR ← CAR 1 if condition = 0
branch
01 CALL CAR ← AD, SBR ← CAR 1 if
01 DR(15) I Indirect condition = 1
address bit CAR ← CAR 1 if condition = 0
● The register set stores data used during the execution of the
instructions.
● ALU performs the required microoperations for executing the
instructions.
● The control unit supervises the transfer of information among the
registers and instructs the ALU as to which operation to perform.
General Register Organization
● Registers are used to store the intermediate values
during instruction execution.
● Register organization show how registers are selected
and how data flow between register and ALU.
● A decoder is used to select a particular register.
● The output of each register is connected to two
multiplexers to form the two buses A and B.
● The selection lines in each multiplexer select the input
data for the particular bus.
● The A and B buses form the two inputs of an ALU.
● The operation select lines decide the micro operation
to be performed by ALU.
● The result of the micro operation is available at the output bus.
● The output bus connected to the inputs of all registers, thus by
selecting a destination register it is possible to store the result in it.
Example : To perform R1 ← R2 + R3,
1. MUX A selector (SELA): to place the content of R2 into bus A.
2. MUX B selector (SELB): to place the content of R3 into bus B.
3. ALU operation selector (OPR): to provide the arithmetic addition A.
4. Decoder destination selector (SELD): to transfer the content of the
output bus into R1.
Control Word
● The combined value of a binary selection inputs specifies the
control word.
● It consist of four fields SELA, SELB,and SELD contains three bit each
and OPR field contains four bits thus the total bits in the control
word are 13-bits.
● The three bit of SELA select a source registers of the a input of the
ALU.
● The three bits of SELB select a source registers of the b input of the
ALU.
● The three bits of SELD select a destination register using the
decoder.
● The four bits of OPR select the operation to be performed by ALU.
Example : R2=R1+R3
Instruction Formats
● The bits of the instruction are divided into groups called fields.
1. Opcode field - specifies the operation to be performed.
2. Address field - specifies a memory address / processor register.
3. Mode field - specifies the way the operand or the effective address is
determined.
● The number of address fields depends on the internal organization
of registers.
● Three types of CPU organizations:
1. Single accumulator organization.
Eg: ADD X
2. General register organization.
Eg: ADD R1, R2, R3 - MOV R1, R2
3. Stack organization
Eg: PUSH X
● Based on these, instructions are classified into four formats.
Three-Address Instruction
● Computers with three-address instruction formats can
use each address field to specify either a processor
register or a memory operand.
● The program in assembly language that evaluates X =
(A + B) * (C + D)
ADD R1, A, B // R1 ← M[A] + M[B]
ADD R2, C, D // R2 ← M[C] + M[D]
MUL X, R1, R2 // M[X] ← R1* R2
● Advantage - It results in short programs when
evaluating arithmetic expressions.
● Disadvantage - The binary-coded instructions require
too many bits to specify three addresses.
● Eg: Commercial computer Cyber 170.
Two-Address Instruction
● Two-address instructions are the most common
in commercial computers.
● Here again each address field can specify either
a processor register or a memory word.
● The program to evaluate X = (A + B) * (C + D)
MOV R1, A // R1 ← M[A]
ADD R1, B // R1 ← R1 + M[B]
MOV R2, C // R2 ← M[C]
ADD R2, D // R2 ← R2 + M[D]
MUL R1, R2 // R1 ← R1*R2
MOV X, R1 // M[X] ← R1
One-Address Instruction
● Use an implied accumulator (AC) register for all data
manipulation.
● Here we neglect the second register and assume that
the AC contains the result of all operations.
● The program to evaluate X = (A + B) * (C + D)
LOAD A // AC ← M[A]
ADD B // AC ← AC + M[B]
STORE T // M[T] ← ΑC
LOAD C // AC ← M[C]
ADD D // AC ← AC + M[D]
MUL T // AC ← AC * M[T]
STORE X // M[X] ← AC
● T is the address of a temporary memory location
required for storing the intermediate result.
Zero-Address Instruction
● Used in stack-organized computers.
● The program to evaluate X = (A + B) * (C + D) for a
stack-organized computer
PUSH A // TOS ← A
PUSH B // TOS ← B
ADD // TOS ← (A + B)
PUSH C // TOS ← C
PUSH D // TOS ← D
ADD // TOS ← (C + D)
MUL // TOS ← (C + D)*(A + B)
POP X // M[X] ← TOS
● To evaluate arithmetic expressions in a stack
computer, it is necessary to convert the expression
into reverse Polish notation.
Addressing Modes
● The operation field of an instruction specifies the operation to be
performed.
● This operation must be executed on some data stored in computer
registers or memory words.
● The way the operands are chosen during program execution is
dependent on the addressing mode of the instruction.
● The addressing mode specifies a rule for interpreting or modifying
the address field of the instruction before the operand is actually
referenced.
● Computers use addressing mode techniques for the purpose of
accommodating one or both of the following provisions:
1. To give programming versatility to the user by providing such
facilities as pointers to memory, counters for loop control, indexing
of data, and program relocation.
2. To reduce the number of bits in the addressing field of the
instruction
1. Implied Mode
● The operands are specified implicitly in the
definition of the instruction.
● Eg: CMA - Complement Accumulator.
● All register reference instructions that use an
accumulator are implied-mode instructions.
● Zero-address instructions in a stack-organized
computer are implied-mode instructions
since the operands are implied to be on top
of the stack.
2. Immediate Mode
● The operand is specified in the instruction
itself.
● I has an operand field rather than an address
field.
● The operand field contains the actual
operand to be used in conjunction with the
operation specified in the instruction.
● They are are useful for initializing registers to
a constant value.
● Eg: ADD 7
3. Register Mode
● The address field specifies a processor register.
● In this mode the operands are in registers that reside
within the CPU.
● The particular register is selected from a register field
in the instruction.
● A k-bit field can specify any one of 2k registers.
● Eg: ADD R1
4. Register Indirect Mode
● The instruction specifies a register in the CPU whose contents give the
address of the operand in memory.
● The selected register contains the address of the operand rather than the
operand itself.
● Before using a register indirect mode instruction, the programmer must
ensure that the memory address of the operand is placed in the
processor register with a previous instruction.
● Advantage - The address field of the instruction uses fewer bits to select
a register.
5. Autoincrement or Autodecrement
Mode
● Similar to the register indirect mode except
that the register is incremented or
decremented after (or before) its value is
used to access memory.
● When the address stored in the register refers
to a table of data in memory, it is necessary
to increment or decrement the register after
every access to the table.
● This can be achieved by using the increment
or decrement instruction.
6.Direct Address Mode
● The effective address is equal to the address part of the
instruction.
● The operand resides in memory and its address is given
directly by the address field of the instruction.
● In a branch-type instruction the address field specifies the
actual branch address.
7. Indirect Address Mode
● The address field of the instruction gives the address
where the effective address is stored in memory.
● Control fetches the instruction from memory and uses
its address part to access memory again to read the
effective address.
8. Relative Address Mode
● Content of the program counter is added to the address part of
the instruction in order to obtain the effective address.
● The address part of the instruction is usually a signed number
(positive or negative).
● This number is added to the content of the program counter,
producing an effective address whose position in memory is
relative to the address of the next instruction.
● It is often used with branch-type instructions.
● Example:
Let PC contains the number 825.
The address part of the instruction contains the number 24.
The instruction at location 825 is read from memory during the
fetch phase and the program counter is then incremented by one
to 826.
The effective address computation for the relative address mode is
826 + 24 = 850.
9. Indexed Addressing Mode
● Content of an index register is added to the address part
of the instruction to obtain the effective address.
● The index register is a special CPU register that contains an
index value.
● The address field of the instruction defines the beginning
address of a data array in memory.
● The distance between the beginning address and the
address of the operand is the index value stored in the
index register.
● Any operand in the array can be accessed with the same
instruction if the index register contains the correct index
value.
● The index register can be incremented to facilitate access
to consecutive operands.
10. Base Register Addressing Mode
● In this mode the content of a base register is added to the address
part of the instruction to obtain the effective address.
● Similar to the indexed addressing mode except that the register is
now called a base register.
● The difference between the two modes is in the way they are used
rather than in the way that they are computed.
● A base register holds a base address and the address field of the
instruction gives a displacement relative to this base address.
● This mode is used to facilitate the relocation of programs in
memory.
● When programs and data are moved from one segment of
memory to another, as required in multiprogramming systems, the
address values of instructions must reflect this change of position.
● Here only the value of the base register requires updating to
reflect the beginning of a new memory segment.
Data Transfer & Manipulation
3. Shift instructions.
Arithmetic Instructions
2. Arithmetic Shift
3. Rotate Shift
Logical Shift
● Inserts 0 to the end bit position.
● The end position is the leftmost bit for shift right and the rightmost bit for the shift left.
55
Arithmetic Shift
● Arithmetic Shift Right - Preserve sign bit in the left most position. Sign bit shifted to right along
with other numbers but sign bit remain unchanged.
56
Rotate Shift
● Bits are shifted out one end are not lost but circulated back into other end.
● Rotate Left Through Carry(ROLC), Rotate right through Carry (RORC) treats carry bit as an
extension of register whose word is being rotated.
57
Program Control
58
Program Control Instructions
● Program control instructions specify conditions for altering the
content of the program counter.
● Provides decision making capabilities.
● Branch (BR) : BR ADR, branch the program to Address ADR.
(PC←ADR)
● Jump (JMP)
● Skip (SKP) : Skip instruction(PC←PC + 1) if some condition is
met.
● Call (CALL) : Used with subroutines
● Return (RET)
● Compare (CMP) : Compare by subtraction.
● Test (by ANDing) TST : AND instruction without storing result.
59
END OF UNIT 2
22CS2112: COMPUTER ORGANIZATION
AND ARCHITECTURE
UNIT-III
By
Dr. G. Bhaskar
Associate Professor
Dept. of ECE
Email Id:bhaskar.0416@gmail.com
UNIT 3
• Data Representation: Data types, Complements,
• Fixed Point Representation,
• Floating Point Representation.
• Computer Arithmetic: Addition and subtraction,
• multiplication Algorithms,
• Division Algorithms,
• Floating – point Arithmetic operations.
• Decimal Arithmetic unit, Decimal Arithmetic operations.
Computer data types
• Computer programs or application may use different types of data based on the
problem or requirement.
• Numeric data
• It can be of the following two types:
• Integers
• Real Numbers
• Real numbers can be represented as:
• Fixed point representation
• Floating point representation
• Character data
• A sequence of character is called character data.
• A character may be alphabetic (A-Z or a-z), numeric (0-9),
special character (+, #, *, @, etc.) or combination of all of
these. A character is represented by group of bits.
•
• When set of multiple character are combined together they
form a meaningful data. A character is represented in
standard ASCII format.Another popular format is EBCDIC
used in large computer systems.
•
• Example of character data
• Rajneesh1#
• 229/3, xyZ
• Mission Milap – X/10
• Logical data
• A logical data is used by computer systems to take logical decisions.
• Logical data is different from numeric or alphanumeric data in the
way that numeric and alphanumeric data may be associated with
numbers or characters but logical data is denoted by either of two
values true (T) or false(F).
•
• You can see the example of logical data in construction of truth
table in logic gates.
• A logical data can also be statement consisting of numeric or
character data with relational symbols (>, <, =, etc.).
•
• Character set
• Character sets can of following types in computers:
• Alphabetic characters- It consists of alphabet characters A-Z or a-z.
• Numeric characters- It consists of digits from 0 to 9.
• Special characters- Special symbols are +, *, /, -, ., <, >, =, @, %, #,
etc.
Fixed point representation
• In computers, fixed-point representation is a real data
type for numbers. Fixed point representation can
convert data into binary form, and then the data is
processed, stored, and used by the computer. It has a
fixed number of bits for the integral and fractional
parts. For example, if given fixed-point representation
is IIIII.FFF, we can store a minimum value of 00000.001
and a maximum value of 99999.999.
• There are three parts of the fixed-point number
representation: Sign bit, Integral part, and Fractional
part.
• Sign bit:- The fixed-point number representation
in binary uses a sign bit. The negative number has
a sign bit 1, while a positive number has a bit 0.
• Integral Part:- The integral part in fixed-point
numbers is of different lengths at different places.
It depends on the register's size; for an 8-bit
register, the integral part is 4 bits.
• Fractional part:- The Fractional part is of different
lengths at different places. It depends on the
registers; for an 8-bit register, the fractional part
is 3 bits.
How to write numbers in Fixed-point
notation?
• Now that we have learned about fixed-point number
representation, let's see how to represent it.
• The number considered is 4.5
• Step 1: We will convert the number 4.5 to binary
form. 4.5 = 100.1
• Step 2: Represent the binary number in fixed-point
notation with the following format.
• Fixed Point Notation of 4.5
Floating Point
Representations
Floating-point
arithmetic
mantissa * 2exponent
s e f
01001 = 1.001× 23 = …
❑ What’s the normalized representation of 00101101.101 ?
00101101.101
= 1.01101101 × 25
01001 = 1.001× 23 = …
❑ The field f contains a binary fraction.
❑ The actual mantissa of the floating-point value is (1 + f).
– In other words, there is an implicit 1 to the left of the binary
point.
– For example, if f is 01101…, the mantissa would be 1.01101…
❑ A side effect is that we get a little more precision: there are 24 bits in
the mantissa, but we only need to store 23 of them.
❑ But, what about value 0?
Exponent
s e f
❑ There are special cases that require encodings
– Infinities (overflow)
– NAN (divide by zero)
❑ For example:
– Single-precision: 8 bits in e → 256 codes; 11111111 reserved for
special cases → 255 codes; one code (00000000) for zero → 254
codes; need both positive and negative exponents → half
positives (127), and half negatives (127)
– Double-precision: 11 bits in e → 2048 codes; 111…1 reserved for
special cases → 2047 codes; one code for zero → 2046 codes;
need both positive and negative exponents → half positives
(1023), and half negatives (1023)
Exponent
s e f
(1 - 2s) * (1 + f) * 2e-bias
1 01111100 11000000000000000000000
(1 - 2s) * (1 + f) * 2e-bias
101011011.101 x 20 = 1.01011011101 x 28
3. The bits to the right of the binary point comprise the fractional
field f.
4. The number of times you shifted gives the exponent. The field e
should contain: exponent + 127.
5. Sign bit: 0 if positive, 1 if negative.
Example
❑ What is the single-precision representation of 639.6875
639.6875 = 1001111111.10112
= 1.0011111111011×29
s=0
e = 9 + 127 = 136 = 10001000
f = 0011111111011
Valid Unnormalized
00000000 X…X
number =(-1)S x 2-126 x (0.F)
❑ Unnormalized
smallest 00 00000000 0000000000000000000001 +2-126(2-23) = 2-149
2-126 2127(2-2-23)
0 2-149 2-126(1-2-23)
Positive overflow
Positive underflow
In comparison
❑ The smallest and largest possible 32-bit integers in two’s
complement are only -231 and 231 - 1
❑ How can we represent so many more values in the IEEE 754
format, even though we use the same number of bits as regular
integers?
2 4 8 16
❑ Thus, floating-point arithmetic has “issues”
– Small roundoff errors can accumulate with multiplications or
exponentiations, resulting in big errors.
– Rounding errors can invalidate many basic arithmetic
principles such as the associative law, (x + y) + z = x + (y + z).
❑ The IEEE 754 standard guarantees that all machines will produce
the same results—but those results may not be mathematically
accurate!
Limits of the IEEE
representation
❑ Even some integers cannot be represented in the IEEE
format.
int x = 33554431;
float y = 33554431;
printf( "%d\n", x );
printf( "%f\n", y );
33554431
33554432.000000
0.1010 = 0.0001100110011...2
0.10
❑ During the Gulf War in 1991, a U.S. Patriot missile failed to intercept
an Iraqi Scud missile, and 28 Americans were killed.
❑ A later study determined that the problem was caused by the
inaccuracy of the binary representation of 0.10.
– The Patriot incremented a counter once every 0.10 seconds.
– It multiplied the counter value by 0.10 to compute the actual
time.
❑ However, the (24-bit) binary representation of 0.10 actually
corresponds to 0.099999904632568359375, which is off by
0.000000095367431640625.
❑ This doesn’t seem like much, but after 100 hours the time ends up
being off by 0.34 seconds—enough time for a Scud to travel 500
meters!
❑ Professor Skeel wrote a short article about this.
Roundoff Error and the Patriot Missile. SIAM News, 25(4):11, July 1992.
Floating-point addition
example
❑ To get a feel for floating-point operations, we’ll do an addition
example.
– To keep it simple, we’ll use base 10 scientific notation.
– Assume the mantissa has four digits, and the exponent
has one digit.
❑ An example for the addition:
31
Steps 1-2: the actual addition
1. Equalize the exponents.
The operand with the smaller exponent should be rewritten by
increasing its exponent and shifting the point leftwards.
9.999 * 101
+ 0.016 * 101
10.015 * 101
Steps 3-5: representing the result
3. Normalize the result if necessary.
This step may cause the point to shift either left or right, and the
exponent to either increase or decrease.
33
Example
❑Calculate 0 1000 0001 110…0 plus 0 1000 0010 00110..0
both are single-precision IEEE 754 representation
34
Multiplication
❑ To multiply two floating-point values, first multiply their magnitudes
and add their exponents.
9.999 * 101
* 1.610 * 10-1
16.098 * 100
❑ You can then round and normalize the result, yielding 1.610 * 101.
❑ The sign of the product is the exclusive-or of the signs of the
operands.
– If two numbers have the same sign, their product is positive.
– If two numbers have different signs, the product is negative.
35
The history of floating-point
computation
❑ In the past, each machine had its own implementation of
floating-point arithmetic hardware and/or software.
– It was impossible to write portable programs that would
produce the same results on different systems.
❑ It wasn’t until 1985 that the IEEE 754 standard was adopted.
– Having a standard at least ensures that all compliant
machines will produce the same outputs for the same
program.
36
Floating-point hardware
❑ When floating point was introduced in microprocessors, there
wasn’t enough transistors on chip to implement it.
– You had to buy a floating point co-processor (e.g., the
Intel 8087)
❑ As a result, many ISA’s use separate registers for floating
point.
❑ Modern transistor budgets enable floating point to be on chip.
– Intel’s 486 was the first x86 with built-in floating point
(1989)
❑ Even the newest ISA’s have separate register files for floating
point.
– Makes sense from a floor-planning perspective.
37
DIVISION IN BINARY
DIVISION HARDWARE
DIVISION FLOW CHART
Floating point operations
• Addition X+Y (adjusted Xm + Ym) 2Ye where Xe ≤ Ye
• Subtraction X-Y (adjusted Xm - Ym) 2Ye where Xe ≤ Ye
• Multiplication X x Y(adjusted Xm x Ym) 2Xe+Ye
• Division X/Y(adjusted Xm / Ym) 2Xe-Ye
Algorithm FP Addition/Subtraction
• Let X and Y be the FP numbers involved in
addition/subtraction, where Ye > Xe.
• Basic steps:
• Compute Ye - Xe, a fixed point subtraction
• Shift the mantissa of Xm by (Ye - Xe) steps to the
right to form Xm2Ye-Xe if Xe is smaller than Ye else
the mantissa of Ym will have to be adjusted.
• Compute Xm2Ye-Xe ± Ym
• Determine the sign of the result
• Normalize the resulting value, if necessary
Multiplication and Division
Results in FP arithmetic
• FP arithmetic results will have to be produced in normalised form.
• Adjusting the bias of the resulting exponent is required. Biased
representation of exponent causes a problem when the exponents
are added in a multiplication or subtracted in the case of division,
resulting in a double biased or wrongly biased exponent. This must
be corrected. This is an extra step to be taken care of by FP
arithmetic hardware.
• When the result is zero, the resulting mantissa has an all zero but
not the exponent. A special step is needed to make the exponent
bits zero.
• Overflow – is to be detected when the result is too large to be
represented in the FP format.
• Underflow – is to be detected when the result is too small to be
represented in the FP format. Overflow and underflow are
automatically detected by hardware, however, sometimes the
mantissa in such occurrence may remain in denormalised form.
• Handling the guard bit (which are extra bits) becomes an issue
when the result is to be rounded rather than truncated.
DECIMAL ADDER
What is a BCD Adder ?
• Decimal adder and BCD Adder both are same.
• A BCD adder, also known as a Binary-Coded Decimal
adder, is a digital circuit that performs addition
operations on Binary-Coded Decimal numbers. BCD is a
numerical representation that uses a four-bit binary
code to represent each decimal digit from 0 to 9. BCD
encoding allows for direct conversion between binary
and decimal representations, making it useful for
arithmetic operations on decimal numbers.
• The purpose of a BCD adder is to add two BCD
numbers together and produce a BCD result. It follows
specific rules to ensure accurate decimal results. The
BCD adder circuit typically consists of multiple stages,
each representing one decimal digit, and utilizes binary
addition circuits combined with BCD-specific rules.
Working of Decimal Adder
• We take a 4-bit Binary-Adder, which takes addend and
augend bits as an input with an input carry 'Carry in'.
• The Binary-Adder produces five outputs, i.e., Z8, Z4, Z2, Z1,
and an output carry K.
• With the help of the output carry K and Z8, Z4, Z2, Z1
outputs, the logical circuit is designed to identify the Cout
• Cout = K + Z8*Z4 + Z8*Z2
• The Z8, Z4, Z2, and Z1 outputs of the binary adder are
passed into the 2nd 4-bit binary adder as an Augend.
• The addend bit of the 2nd 4-bit binary adder is designed in
such a way that the 1st and the 4th bit of the addend
number are 0 and the 2nd and the 3rd bit are the same as
Cout. When the value of Cout is 0, the addend number will be
0000, which produce the same result as the 1st 4-bit binary
number. But when the value of the Cout is 1, the addend bit
will be 0110, i.e., 6, which adds with the augent to get the
valid BCD number.
• Example: 1001+1000
• First, add both the numbers using a 4-bit binary adder and pass the
input carry to 0.
• The binary adder produced the result 0001 and carried output 'K' 1.
• Then, find the Cout value to identify that the produced BCD is invalid
or valid using the expression Cout=K+Z8.Z4+Z8.Z2.
K=1
Z8 = 0
Z4 = 0
Z2 = 0
Cout = 1+0*0+0*0
Cout = 1+0+0
Cout = 1
• The value of Cout is 1, which expresses that the produced BCD code
is invalid. Then, add the output of the 1st 4-bit binary adder with
0110.
= 0001+0110
= 0111
• The BCD is represented by the carry output as:
BCD=Cout Z8 Z4 Z2 Z1=1 0 1 1 1
Algorithm for Decimal Adder
• BCD Addition of Given Decimal Number
• BCD addition of a given decimal number involves performing
addition operations on the individual BCD digits of the number.
• Step 1: Convert the decimal number into BCD representation:
• Take each decimal digit and convert it into its BCD equivalent, which
is a four-bit binary code.
• For example, the decimal number 456 would be represented as
0100 0101 0110 in BCD.
• Step 2: Align the BCD numbers for addition:
Ensure that the BCD numbers to be added have the same number
of digits.
If necessary, pad the shorter BCD number with leading zeros to
match the length of the longer BCD number.
• Step 3: Perform binary addition on the BCD digits:
• Start from the rightmost digit and add the corresponding BCD digits
of the two numbers.
• If the sum of the BCD digits is less than or equal to 9 (0000 to 1001
in binary), it represents a valid BCD digit.
• If the sum is greater than 9 (1010 to 1111 in binary), it indicates a
carry-out, and a correction is needed.
• Step 4: Handle carry-out and correction:
• When a carry-out occurs, it needs to be propagated to the next
higher-order digit. Add the carry-out to the next higher-order digit's
BCD digit and continue the process until all digits have been
processed.Step 5: Obtain the final BCD result:
• Once all the BCD digits have been processed, the resulting BCD
numbers represent the decimal sum of the original BCD numbers.
Decimal Arithmetic
Operation
Decimal Arithmetic Operation
• Incrementing/Decrementing a register is
same for binary and BCD that binary
counter goes through 16 states from 0000 to
1111.
• The BCD counter goes through 10 states
from 0000 to 1001 and back to 0000.
• A decimal shift right or left is proceeded by
d to indicate a shift over four bits that
hold the decimal digit.
Addition and Subtraction
• Peripheral Devices
• Input-Output Interface
• Modes of Transfer
• Priority Interrupt
• Input-Output Processor
• Serial Communication
Peripheral Devices
PERIPHERAL DEVICES
Input Devices Output Devices
• Keyboard • Card Puncher, Paper Tape Puncher
• Optical input devices • CRT
- Card Reader • Printer (Impact, Ink Jet,
- Paper Tape Reader Laser, Dot Matrix)
- Bar code reader • Plotter
- Digitizer • Analog
- Optical Mark Reader • Voice
• Magnetic Input Devices
- Magnetic Stripe Reader
• Screen Input Devices
- Touch Screen
- Light Pen
- Mouse
• Analog Input Devices
Input/Output Interfaces
INPUT/OUTPUT INTERFACE
Provides a method for transferring information between internal storage (such as
memory and CPU registers) and external I/O devices
Resolves the differences between the computer and peripheral devices
Peripherals - Electromechanical Devices
CPU or Memory - Electronic Device
Unit of Information
Peripherals – Byte, Block, …
CPU or Memory – Word
Keyboard
and Printer Magnetic Magnetic
display disk tape
terminal
Interface
- Decodes the device address (device code)
- Decodes the commands (operation)
- Provides signals for the peripheral controller
- Synchronizes the data flow and supervises
the transfer rate between peripheral and CPU or Memory
Typical I/O instruction
Op. code Device address Function code
(Command)
Input/Output Interfaces
Functions of Buses
MM
I/O INTERFACE
Port A I/O data
register
Bidirectional Bus
data bus buffers
Port B I/O data
register
Handshaking
- A control signal is accompanied with each data
being transmitted to indicate the presence of data
- The receiving unit responds with another control
signal to acknowledge receipt of the data
Asynchronous Data Transfer
STROBE CONTROL
* Employs a single control line to time each transfer
* The strobe may be activated by either the source or
the destination unit
Strobe Strobe
Asynchronous Data Transfer
HANDSHAKING
Strobe Methods
Source-Initiated
Destination-Initiated
Valid data
Data bus
Timing Diagram
Data valid
Data accepted
Data valid
Valid data
Data bus
1 1 0 0 0 1 0 1
Start Character bits Stop
bit bits
(1 bit) (at least 1 bit)
Internal Bus
and clock
Chip select CS
Register select Status Receiver Receiver CS RS Oper. Register selected
RS Timing clock
register control 0 x x None
I/O read and and clock 1 0 WR Transmitter register
RD Control 1 1 WR Control register
I/O write Receive 1 0 RD Receiver register
WR Receiver Shift data
1 1 RD Status register
register register
Transmitter Register
- Accepts a data byte(from CPU) through the data bus
- Transferred to a shift register for serial transmission
Receiver
- Receives serial information into another shift register
- Complete data byte is sent to the receiver register
Status Register Bits
- Used for I/O flags and for recording errors
Control Register Bits
- Define baud rate, no. of bits in each character, whether
to generate and check parity, and no. of stop bits
Modes of Transfer
PRIORITY INTERRUPT
Priority
- Determines which interrupt is to be served first
when two or more requests are made simultaneously
- Also determines which interrupts are permitted to
interrupt the computer while another is being serviced
- Higher priority interrupts can make requests while
servicing a lower priority interrupt
2
Interrupt
to CPU
3
INTACK
from CPU
IEN: Set or Clear by instructions ION or IOF
IST: Represents an unmasked interrupt has occurred. INTACK enables
tristate Bus Buffer to load VAD generated by the Priority Logic
Interrupt Register:
- Each bit is associated with an Interrupt Request from
different Interrupt Source - different priority level
- Each bit can be cleared by a program instruction
Mask Register:
- Mask Register is associated with Interrupt Register
- Each bit can be set or cleared by an Instruction
Priority Interrupt
Determines the highest priority interrupt when more than one interrupts take place
Inputs Outputs
I0 I1 I2 I 3 x y IST Boolean functions
1 d d d 0 0 1
0 1 d d 0 1 1
0 0 1 d 1 0 1 x = I0' I1'
0 0 0 1 1 1 1 y = I0' I1 + I0’ I2’
0 0 0 0 d d 0 (IST) = I0 + I1 + I2 + I3
Priority Interrupt
INTERRUPT CYCLE
ABUS Address bus High-impedence
Bus request BR DBUS Data bus (disabled)
CPU when BG is
Bus granted BG RD Read
WR Write enabled
Bus grant BG
Interrupt Interrupt DMA request
DMA acknowledge to I/O device
Direct Memory Access
Input
[1] Input Device <- R (Read control signal)
[2] Buffer(DMA Controller) <- Input Byte; and
assembles the byte into a word until word is full
[4] M <- memory address, W(Write control signal)
[5] Address Reg <- Address Reg +1; WC(Word Counter) <- WC - 1
[6] If WC = 0, then Interrupt to acknowledge done, else go to [1]
Output
[1] M <- M Address, R
M Address R <- M Address R + 1, WC <- WC - 1
[2] Disassemble the word
[3] Buffer <- One byte; Output Device <- W, for all disassembled bytes
[4] If WC = 0, then Interrupt to acknowledge done, else go to [1]
Direct Memory Access
CYCLE STEALING
While DMA I/O takes place, CPU is also executing instructions
DMA Controller and CPU both access Memory -> Memory Access Conflict
Cycle Steal
DMA TRANSFER
Interrupt
Random-access
BG
CPU memory unit (RAM)
BR
RD WR Addr Data RD WR Addr Data
Read control
Write control
Data bus
Address bus
Address
select
RD WR Addr Data
DS DMA ack.
RS DMA I/O
Controller Peripheral
BR device
BG DMA request
Interrupt
MEMORY ORGANIZATION
• Memory Hierarchy
• Main Memory
• Auxiliary Memory
• Associative Memory
• Cache Memory
• Virtual Memory
CPU Cache
memory
Register
Cache
Main Memory
Magnetic Disk
Magnetic Tape
Main Memory
MAIN MEMORY
RAM and ROM Chips
Typical RAM chip
Decoder
3 2 1 0
CS1
CS2
Data
RD 128 x 8
RAM 1
WR
AD7
CS1
CS2
Data
RD 128 x 8
RAM 2
WR
AD7
CS1
CS2
Data
RD 128 x 8
RAM 3
WR
AD7
CS1
CS2
RD 128 x 8 Data
RAM 4
WR
AD7
CS1
CS2
Data
1- 7 512 x 8
8
9 } AD9 ROM
Auxiliary Memory
AUXILIARY MEMORY
Information Organization on Magnetic Tapes
file i
block 1 block 2
block 3 EOF
R1
R2 R3 R4
R5
R6
block 3 IRG
R1
EOF R3 R2
R5 R4 block 1
block 2
Track
Associative Memory
ASSOCIATIVE MEMORY
- Accessed by the content of the data rather than by an address
- Also called Content Addressable Memory (CAM)
Hardware Organization Argument register(A)
ORGANIZATION OF CAM
A1 Aj An
K1 Kj Kn
Write
R S
F ij Match To Mi
Read logic
Output
Associative Memory
MATCH LOGIC
K1 A1 K2 A2 Kn An
Mi
Cache Memory
CACHE MEMORY
Locality of Reference
- The references to memory at any given time
interval tend to be confined within a localized areas
- This area contains a set of information and
the membership changes gradually as time goes by
- Temporal Locality
The information which will be used in near future
is likely to be in use already( e.g. Reuse of information in loops)
- Spatial Locality
If a word is accessed, adjacent(near) words are likely accessed soon
(e.g. Related data items (arrays) are usually stored together;
instructions are executed sequentially)
Cache
- The property of Locality of Reference makes the
Cache memory systems work
- Cache is a fast small capacity memory that should hold those information
which are most likely to be accessed
Main memory
CPU
Cache memory
Cache Memory
PERFORMANCE OF CACHE
Memory Access
All the memory accesses are directed first to Cache
If the word is in Cache; Access cache to provide it to CPU
If the word is not in Cache; Bring a block (or a line) including
that word to replace a block now in Cache
Te = Tc + (1 - h) Tm
Argument register
Address Data
01000 3450
CAM 02777 6710
22235 1234
Cache Memory
00 000 32K x 12
000
512 x 12
Main memory Cache memory
Address = 15 bits Address = 9 bits
Data = 12 bits Data = 12 bits
77 777 777
01777 4560
02000 5670
777 02 6710
02777 6710
Thank You
22CS2112: COMPUTER ORGANIZATION
AND ARCHITECTURE
UNIT - V
REDUCED INSTRUCTION SET COMPUTER, PIPELINE
AND VECTOR PROCESSING AND MULTI PROCESSORS
By
Dr. G. Bhaskar
Associate Professor
Dept. of ECE
Email Id:bhaskar.0416@gmail.com
Reduced Instruction Set Computer:
CISC Characteristics
RISC Characteristics.
Multi Processors:
Cache Coherence.
What is CISC?
Complex Instruction Set Computer (CISC) Characteristics
Major characteristics of a CISC architecture
• A large number of instructions - typically from 100 to 250
instruction
• Some instructions that perform specialized tasks and are used
infrequently
• A large variety of addressing modes - typically from 5 to 20
different modes
• Variable-length instruction formats
• Instructions that manipulate operands in memory (RISC in
register)
What is RISC?
Disadvantages of CISC
• Slower execution
• More complex design
• Higher power consumption
Advantages of RISC:
• Simpler instructions
• Faster execution
• Lower power consumption
Disadvantages of RISC
• More instructions required
• Increased memory usage
• Higher cost
RISC CISC
Focus on software Focus on hardware
Uses only Hardwired control unit Uses both hardwired and microprogrammed
control unit
Transistors are used for more registers Transistors are used for storing complex
Instructions
Fixed sized instructions Variable sized instructions
Can perform only Register to Register
Arithmetic operations Can perform REG to REG or REG to MEM or
MEM to MEM
Requires more number of registers Requires less number of registers
Code size is large Code size is small
An instruction executed in a single clock
cycle Instruction takes more than one clock cycle
An instruction fit in one word. Instructions are larger than the size of one
word
Simple and limited addressing modes. Complex and more addressing modes.
RISC is Reduced Instruction Cycle. CISC is Complex Instruction Cycle.
The number of instructions are less as
compared to CISC. The number of instructions are
more as compared to RISC.
It consumes the low power. It consumes more/high power.
RISC is highly pipelined. CISC is less pipelined.
RISC required more RAM. CISC required less RAM.
Here, Addressing modes are less. Here, Addressing modes are more.
CISC RISC
CISC is the short form for Complex RISC refers to Reduced Instruction Set
Instruction Set Computer. Computer.
A large number of instructions are present in Very few instructions are present. The
the architecture. number of instructions is generally less than
100.
The CSIC architecture processes complex The RISC architecture executes simple yet
instructions that require several clock cycles optimized instructions in a single clock cycle.
for execution. On average, it takes two to five It processes instructions at an average speed
clock cycles per instruction (CPI) of 1.5 clock cycles per instruction (CPI).
Variable-length encodings of the Fixed-length encodings of the instructions are
instructions. used.
CISC instructions require high execution RISC instructions require less time for
time. execution
Some examples of CISC processors include Examples of RISC processors include Alpha,
Intel x86 CPUs, System/360, VAX, PDP-11, ARC, ARM, AVR, MIPS, PA-RISC, PIC,
Motorola 68000 family, and AMD Power Architecture, and SPARC
CISC does not support parallelism and RISC processors support instruction
pipelining. As such, CISC instructions are pipelining
less pipelined
Parallel Processing
C = A1B1+A2B2+A3B3+…+Ak Bk
C = A1B1+A5B5+A9B9+A13 B13+…
+A2B2+A6B6+A10B10+A14 B14+…
+A3B3+A7B7+A11B11+A15 B15+…
+A4B4+A8B8+A12B12+A16 B16+…
VECTOR INSTRUCTION FORMAT
MULTIPLE MEMORY MODULE AND INTERLEAVING
MULTIPLE MEMORY MODULE AND INTERLEAVING
MULTIPLE MEMORY MODULE AND INTERLEAVING
ARRAY PROCESSOR
attached array processor with host computer
SIMD array processor Organization
Pipelining and Vector Processing 1
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• RISC Pipeline
• Vector Processing
• Array Processors
Pipelining and Vector Processing 2 Parallel Processing
PARALLEL PROCESSING
- Inter-Instruction level
- Intra-Instruction level
Pipelining and Vector Processing 3 Parallel Processing
PARALLEL COMPUTERS
Architectural Classification
– Flynn's classification
» Based on the multiplicity of Instruction Streams and
Data Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor
VLIW
MISD Nonexistence
Systolic arrays
Dataflow
Associative processors
Message-passing multicomputers
Hypercube
Mesh
Reconfigurable
Pipelining and Vector Processing 5 Parallel Processing
Instruction stream
Characteristics
- Standard von Neumann machine
- Instructions and data are stored in memory
- One operation at a time
Limitations
Von Neumann bottleneck
• Multiprogramming
• Spooling
• Multifunction processor
• Pipelining
• Exploiting instruction-level parallelism
- Superscalar
- Superpipelining
- VLIW (Very Long Instruction Word)
Pipelining and Vector Processing 7 Parallel Processing
M CU P
M CU P
Memory
• •
• •
• •
M CU P Data stream
Instruction stream
Characteristics
- There is no computer at present that can be
classified as MISD
Pipelining and Vector Processing 8 Parallel Processing
Control Unit
Instruction stream
Data stream
Alignment network
Characteristics
- Only one copy of the program exists
- A single controller executes one instruction at a time
Pipelining and Vector Processing 9 Parallel Processing
Array Processors
- The control unit broadcasts instructions to all PEs,
and all active PEs execute the same instructions
- ILLIAC IV, GF-11, Connection Machine, DAP, MPP
Systolic Arrays
Associative Processors
- Content addressing
- Data transformation operations over many sets
of arguments with a single instruction
- STARAN, PEPE
Pipelining and Vector Processing 10 Parallel Processing
Interconnection Network
Shared Memory
Characteristics
- Multiple processing units
- Message-passing multicomputers
Pipelining and Vector Processing 11 Parallel Processing
Buses,
Interconnection Network(IN) Multistage IN,
Crossbar Switch
P P ••• P
Characteristics
All processors have equally direct access to
one large memory address space
Example systems
Bus and cache-based systems
- Sequent Balance, Encore Multimax
Multistage IN-based systems
- Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems
- C.mmp, Alliant FX/8
Limitations
Memory access latency
Hot spot problem
Pipelining and Vector Processing 12 Parallel Processing
MESSAGE-PASSING MULTICOMPUTER
Message-Passing Network Point-to-point connections
P P ••• P
M M ••• M
Characteristics
- Interconnected computers
- Each processor has its own memory, and
communicate via message-passing
Example systems
- Tree structure: Teradata, DADO
- Mesh-connected: Rediflow, Series 2010, J-Machine
- Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III
Limitations
- Communication overhead
- Hard to programming
Pipelining and Vector Processing 13 Pipelining
PIPELINING
A technique of decomposing a sequential process
into suboperations, with each subprocess being
executed in a partial dedicated segment that
operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Space-Time Diagram
1 2 3 4 5 6 7 8 9 Clock cycles
Segment 1 T1 T2 T3 T4 T5 T6
2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
Pipelining and Vector Processing 16 Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Speedup
Sk: Speedup
Sk = n*tn / (k + n - 1)*tp
tn
lim Sk = ( = k, if tn = k * tp )
n tp
Pipelining and Vector Processing 17 Pipelining
Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS
Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS
Speedup
Sk = 8000 / 2060 = 3.88
ARITHMETIC PIPELINE
Floating-point adder Exponents
a b
Mantissas
A B
X = A x 2a
Y = B x 2b R R
R R
R R
Pipelining and Vector Processing 19 Arithmetic Pipeline
Stages: Other
Exponent fraction Fraction
S1 subtractor selector
Fraction with min(p,q)
r = max(p,q)
Right shifter
t = |p - q|
Fraction
S2 adder
r c
Leading zero
S3 counter
c
Left shifter
r
d
Exponent
S4 adder
s d
C = A + B = c x 2r = d x 2s
(r = max (p,q), 0.5 d < 1)
Pipelining and Vector Processing 20 Instruction Pipeline
INSTRUCTION CYCLE
Six Phases* in an Instruction Cycle
1 Fetch an instruction from memory
2 Decode the instruction
3 Calculate the effective address of the operand
4 Fetch the operands from memory
5 Execute the operation
6 Store the result in the proper place
INSTRUCTION PIPELINE
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelining and Vector Processing 22 Instruction Pipeline
Decode instruction
Segment2: and calculate
effective address
Branch?
yes
no
Fetch operand
Segment3: from memory
Interrupt yes
Interrupt?
handling
no
Update PC
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Pipelining and Vector Processing 23 Instruction Pipeline
R1 <- R1 + 1
INC DA bubble R1 +1
Control hazards
Branches and other instructions that change the PC
make the fetch of the next instruction to be delayed
JMP ID PC + PC Branch address dependency
bubble IF ID OF OE OS
STRUCTURAL HAZARDS
Structural Hazards
Occur when some resource has not been
duplicated enough to allow all combinations
of instructions in the pipeline to execute
i+1 FI DA FO EX
i+2 stall stall FI DA FO EX
DATA HAZARDS
Data Hazards
Interlock
- hardware detects the data dependencies and delays the scheduling
of the dependent instruction by stalling enough clock cycles
Forwarding (bypassing, short-circuiting)
- Accomplished by a data path that routes a value from a source
(usually an ALU) to a user, bypassing a designated register. This
allows the value to be produced to be used at an earlier stage in the
pipeline than would otherwise be possible
Software Technique
Instruction Scheduling(compiler) for delayed load
Pipelining and Vector Processing 26 Instruction Pipeline
FORWARDING HARDWARE
Example:
Register
file
ADD R1, R2, R3
SUB R4, R1, R5
INSTRUCTION SCHEDULING
a = b + c;
d = e - f;
Delayed Load
A load requiring that the following instruction not use its result
Pipelining and Vector Processing 28 Instruction Pipeline
CONTROL HAZARDS
Branch Instructions
CONTROL HAZARDS
Prefetch Target Instruction
– Fetch instructions in both streams, branch not taken and branch taken
– Both are saved until branch branch is executed. Then, select the right
instruction stream and discard the wrong stream
Branch Target Buffer(BTB; Associative Memory)
– Entry: Addr of previously executed branches; Target instruction
and the next few instructions
– When fetching an instruction, search BTB.
– If found, fetch the instruction stream in BTB;
– If not, new stream is fetched and update BTB
Loop Buffer(High Speed Register file)
– Storage of entire loop that allows to execute a loop without accessing memory
Branch Prediction
– Guessing the branch condition, and fetch an instruction stream based on
the guess. Correct guess eliminates the branch penalty
Delayed Branch
– Compiler detects the branch and rearranges the instruction sequence
by inserting useful instructions that keep the pipeline busy
in the presence of a branch instruction
Pipelining and Vector Processing 30 RISC Pipeline
RISC PIPELINE
RISC
- Machine with a very fast clock cycle that
executes at the rate of one instruction per cycle
<- Simple Instruction Set
Fixed Length Instruction Format
Register-to-Register Operations
Instruction Cycles of Three-Stage Instruction Pipeline
Data Manipulation Instructions
I: Instruction Fetch
A: Decode, Read Registers, ALU Operations
E: Write a Register
DELAYED LOAD
LOAD: R1 M[address 1]
LOAD: R2 M[address 2]
ADD: R3 R1 + R2
STORE: M[address 3] R3
Three-segment pipeline timing
Pipeline timing with data conflict
clock cycle 1 2 3 4 5 6
Load R1 I A E
Load R2 I A E
Add R1+R2 I A E
Store R3 I A E
clock cycle 1 2 3 4 5 6 7
Load R1 I A E
The data dependency is taken
Load R2 I A E care by the compiler rather
NOP I A E than the hardware
Add R1+R2 I A E
Store R3 I A E
Pipelining and Vector Processing 32 RISC Pipeline
DELAYED BRANCH
Compiler analyzes the instructions before and after
the branch and rearranges the program sequence by
inserting useful instructions in the delay steps
VECTOR PROCESSING
• A vector processor is an ensemble of hardware resources, including
vector registers, functional pipelines, processing elements and register
counters for performing register operations.
VECTOR PROCESSING
VECTOR PROGRAMMING
DO 20 I = 1, 100
20 C(I) = B(I) + A(I)
Conventional computer
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = i + 1
If I 100 goto 20
Vector computer
VECTOR INSTRUCTIONS
f1: V ->V (Vector-Vector Instruction)
f2: V S (Vector Reduction Instruction)
f3: V x V V (Vector-Vector Instruction) V: Vector operand
f4: V x S >V (Vector- Scaler Instruction) S: Scalar operand
Source
A
Source Mu l ti p l i er Adder
B pipeline pipeline
Pipelining and Vector Processing 38
Matrix Multiplication
A11 A12 A13 B11 B12 B13 C11 C12 C13
A21 A22 A23
× B21 B22 B23 = C21 C22 C23
A31 A32 A33 B31 B32 B33
C31 C32 C33
In general
AR AR AR AR
DR DR DR DR
Data bus
Address Interleaving
1. A multiprocessor system is an interconnection of two or more CPU’s with memory and input-output
equipment.
2. Multiprocessors system are classified as multiple instruction stream, multiple data stream
systems(MIMD).
3. There exists a distinction between multiprocessor and multicomputers that though both support
concurrent operations.
4. In multicomputers several autonomous computers are connected through a network and they may
or may not communicate but in a multiprocessor system there is a single OS Control that provides
interaction between processors and all the components of the system to cooperate in the solution of
the problem.
5. VLSI circuit technology has reduced the cost of the computers to such a low Level that the concept
of applying multiple processors to meet system performance requirements has become an attractive
design possibility.
Fig. 5.2 Taxonomy of mono- mulitporcessor organizations
Characteristics of Multiprocessors:
Benefits of Multiprocessing:
1. Multiprocessing increases the reliability of the system so that a failure or error in one part has
limited effect on the rest of the system. If a fault causes one processor to fail, a second
processor can be assigned to perform the functions of the disabled one.
2. Improved System performance. System derives high performance from the fact that
computations can proceed in parallel in one of the two ways:
• Multiple independent jobs can be made to operate in parallel.
• A single job can be partitioned into multiple parallel tasks. This can be achieved in two
ways:
- The user explicitly declares that the tasks of the program be executed in
parallel
- The compiler provided with multiprocessor s/w that can automatically detect parallelism in program. Actually it
checks for Data dependency
COUPLING OF PROCESSORS
Tightly Coupled System/Shared Memory:
- Tasks and/or processors communicate in a highly synchronized fashion
- Communicates through a common global shared memory
- Shared memory system. This doesn’t preclude each processor from having its own local memory(cache memory)
Loosely Coupled System/Distributed Memory
- Tasks or processors do not communicate in a synchronized fashion.
- Communicates by message passing packets consisting of an address, the data content, and some error detection code.
- Overhead for data exchange is high
- Distributed memory system
Loosely coupled systems are more efficient when the interaction between tasks is minimal, whereas tightly coupled system
can tolerate a higher degree of interaction
between tasks.
Shared (Global) Memory
- A Global Memory Space accessible by all processors
- Processors may also have some local memory Distributed (Local, Message-Passing) Memory
- All memory units are associated with processors
- To retrieve information from another processor's memory a message must be sent there
Uniform Memory
- All processors take the same time to reach all memory locations Non-uniform (NUMA) Memory
- Memory access is not uniform
Fig. 5.3 Shared and distributed memory
- Each switch point has control logic to set up the transfer path between a processor and a
memory.
- It also resolves the multiple requests for access to the same memory on the predetermined
priority basis.
- Though this organization supports simultaneous transfers from all memory modules because
there is a separate path associated with each Module.
- The H/w required to implement the switch can become quite large and complex
a) b)
Fig. 5.8 a) cross bar switch b) Block diagram of cross bar switch
Advantage:
- Supports simultaneous transfers from all memory modules Disadvantage:
- The hardware required to implement the switch can become quite large and complex.
d. Multistage Switching Network:
- The basic component of a multi stage switching network is a two-input, two- output interchange switch.
Fig. 5.9 operation of 2X2 interconnection switch
Using the 2x2 switch as a building block, it is possible to build a multistage network to control the
communication between a number of sources and destinations.
- To see how this is done, consider the binary tree shown in Fig. below.
- Certain request patterns cannot be satisfied simultaneously. i.e., if P1 000~011, then P2
100~111
- Set up the path transfer the address into memory transfer the data
- In a loosely coupled multiprocessor system, both the source and destination are Processsing elements.
e. Hypercube System:
- A routing procedure can be developed by computing the exclusive-OR of the source node address with the
destination node address.
- The message is then sent along any one of the axes that the resulting binary value will have 1 bits
corresponding to the axes on which the two nodes differ.
3. Inter-processor Arbitration
- Only one of CPU, IOP, and Memory can be granted to use the bus at a time
- Arbitration mechanism is needed to handle multiple requests to the shared resources to resolve multiple
contention
- SYSTEM BUS:
• A bus that connects the major components such as CPU’s, IOP’s and memory
• A typical System bus consists of 100 signal lines divided into three functional groups: data, address and
control lines. In addition there are power distribution lines to the components.
- Synchronous Bus
• Each data item is transferred over a time slice
• known to both source and destination unit
• Common clock source or separate clock and synchronization signal is transmitted periodically to
synchronize the clocks in the system
- Asynchronous Bus
• Each data item is transferred by Handshake mechanism
Unit that transmits the data transmits a control signal that indicates the presence of data
Unit that receiving the data responds with another control signal to acknowledge the receipt of the data
o Strobe pulse -supplied by one of the units to indicate to the other unit when the data transfer has to occur
Table 5.1 IEEE standard 796 multibus signals