Professional Documents
Culture Documents
Unit-5 Computer Organization Notes
Unit-5 Computer Organization Notes
S.
Name of the Topic
No.
UNIT-V: PIPELINE AND VECTOR PROCESSING, MULTIPROCESSORS, MULTICORE
COMPUTERS
1. Pipeline and Vector Processing: Parallel processing
2. Pipelining
3. Instruction pipeline,
4. Vector processing
5. Array processors
6. Multiprocessors: Characteristics of multiprocessors,
7. Interconnection structures
8. Interprocessor arbitration
9. Multicore Computers: Hardware performance issues,
Parallel processing can be described as a class of techniques which enables the system to achieve simultaneous
data-processing tasks to increase the computational speed of a computer system.
A parallel processing system can carry out simultaneous data-processing to achieve faster execution time. For
instance, while an instruction is being processed in the ALU component of the CPU, the next instruction can be
read from memory.
The primary purpose of parallel processing is to enhance the computer processing capability and increase its
throughput, i.e. the amount of processing that can be accomplished during a given interval of time.
A parallel processing system can be achieved by having a multiplicity of functional units that perform identical or
different operations simultaneously. The data can be distributed among various multiple functional units.
the following diagram shows one possible way of separating the execution unit into eight functional units
operating in parallel.
The operation performed in each functional unit is indicated in each block if the diagram:
o The adder and integer multiplier performs the arithmetic operation with integer numbers.
o The floating-point operations are separated into three circuits operating in parallel.
o The logic, shift, and increment operations can be performed concurrently on different data. All units are
independent of each other, so one number can be shifted while another number is being incremented.
Pipelining
The term Pipelining refers to a technique of decomposing a sequential process into sub-operations, with each sub-
operation being executed in a dedicated segment that operates concurrently with all other segments.
The most important characteristic of a pipeline technique is that several computations can be in progress in
distinct segments at the same time. The overlapping of computation is made possible by associating a register
with each segment in the pipeline. The registers provide isolation between each segment so that each can operate
on distinct data simultaneously.
The structure of a pipeline organization can be represented simply by including an input register for each segment
followed by a combinational circuit.
Let us consider an example of combined multiplication and addition operation to get a better understanding of the
pipeline organization.
he combined multiplication and addition operation is done with a stream of numbers such as:
The operation to be performed on the numbers is decomposed into sub-operations with each sub-operation to be
implemented in a segment within a pipeline.
The sub-operations performed in each segment of the pipeline are defined as:
The following block diagram represents the combined as well as the sub-operations performed in each segment of
the pipeline.
Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular segment.
The output generated by the combinational circuit in a given segment is applied as an input register of the next
segment. For instance, from the block diagram, we can see that the register R3 is used as one of the input registers
for the combinational adder circuit.
In general, the pipeline organization is applicable for two areas of computer design which includes:
• Arithmetic Pipeline
• Instruction Pipeline
Instruction Pipeline
Pipeline processing can occur not only in the data stream but in the instruction stream as well.
K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 4
COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE
Most of the digital computers with complex instructions require instruction pipeline to carry out operations like
fetch, decode and execute instructions.
In general, the computer needs to process each instruction with the following sequence of steps.
Each step is executed in a particular segment, and there are times when different segments may take different
times to operate on the incoming information. Moreover, there are times when two or more segments may require
memory access at the same time, causing one segment to wait until another is finished with the memory.
The organization of an instruction pipeline will be more efficient if the instruction cycle is divided into segments
of equal duration. One of the most common examples of this type of organization is a Four-segment instruction
pipeline.
A four-segment instruction pipeline combines two or more different segments and makes it as a single one. For
instance, the decoding of the instruction can be combined with the calculation of the effective address into one
segment.
The following block diagram shows a typical example of a four-segment instruction pipeline. The instruction
cycle is completed in four segments.
Segment 1:
The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.
Segment 2:
The instruction fetched from memory is decoded in the second segment, and eventually, the effective address is
calculated in a separate arithmetic circuit.
Segment 3:
Segment 4:
The instructions are finally executed in the last segment of the pipeline organization.
Vector(Array) Processing
There is a class of computational problems that are beyond the capabilities of a conventional computer.
These problems require vast number of computations on multipledata items, that will take a conventional
computer (with scalar processor) days or even weeks to complete.
Such complex instructions, which operates on multiple data at the same time, requires abetter way of
instruction execution, which was achieved by Vector processors.
Scalar CPUs can manipulate one or two data items at a time, which is not very efficient. Also, simple
instructions like ADD A to B, and store into C are not practically efficient.
Addresses are used to point to the memory location where the data to be operated will be found, which leads
to added overhead of data lookup. So until the data is found, the CPU would be sitting ideal, which is a big
performance issue.
Hence, the concept of Instruction Pipeline comes into picture, in which the instruction passes through
several sub-units in turn. These sub-units perform various independent functions, for example: the first one
decodes the instruction, the second sub-unit fetches the data and the third sub-unit performs the math itself.
Therefore, while the datais fetched for one instruction, CPU does not sit idle, it rather works on decoding the
nextinstruction set, ending up working like an assembly line.
Vector processor, not only use Instruction pipeline, but it also pipelines the data, workingon multiple data at
the same time.
A normal scalar processor instruction would be ADD A, B, which leads to addition of two operands, but
what if we can instruct the processor to ADD a group of numbers(from 0 to n memory location) to
another group of numbers(lets say, nto kmemory location). This can be achieved by vector processors.
In vector processor a single instruction, can ask for multiple data operations, which savestime, as instruction is
decoded once, and then it keeps on operating on different data items.
1. Petroleum exploration.
2. Medical diagnosis.
3. Data analysis.
4. Weather forecasting.
5. Aerodynamics and space flight simulations.
6. Image processing.
7. Artificial intelligence.
Multiprocessors
Multiport Memory
• A multiport memory system employs separate buses between each memory module andeach CPU.
• The module must have internal control logic to determine which port will have access tomemory at any given
time.
• Memory access conflicts are resolved by assigning fixed priorities to each memory port.
• Adv.:
o The high transfer rate can be achieved because of the multiple paths.
• Disadv.:
o It requires expensive memory control logic and a large number of cables andconnections
Crossbar Switch
• Consists of a number of crosspoints that are placed at intersections between processorbuses and memory
module paths.
• The small square in each crosspoint is a switch that determines the path from a processorto a memory module.
• Adv.:
o Supports simultaneous transfers from all memory modules
• Disadv.:
o The hardware required to implement the switch can become quite large andcomplex.
• Below fig. shows the functional design of a crossbar switch connected to one memorymodule.
• Using the 2x2 switch as a building block, it is possible to build a multistage network tocontrol the
communication between a number of sources and destinations.
o To see how this is done, consider the binary tree shown in Fig. below.
o Certain request patterns cannot be satisfied simultaneously. i.e., if P1 000~011,then P2 100~111
• One such topology is the omega switching network shown in Fig. below
• Some request patterns cannot be connected simultaneously. i.e., any two sources cannotbe connected
simultaneously to destination 000 and 001
• In a tightly coupled multiprocessor system, the source is a processor and the destinationis a memory module.
• Set up the path transfer the address into memory transfer the data
• In a loosely coupled multiprocessor system, both the source and destination areprocessing elements.
Hypercube System
• The hypercube or binary n-cube multiprocessor structure is a loosely coupled systemcomposed of N=2n
processors interconnected in an n-dimensional binary cube.
o Each processor forms a node of the cube, in effect it contains not only a CPU butalso local memory and
I/O interface.
o Each processor address differs from that of each of its n neighbors by exactly onebit position.
• Fig. below shows the hypercube structure for n=1, 2, and 3.
• Routing messages through an n-cube structure may take from one to n links from asource node to a
destination node.
o A routing procedure can be developed by computing the exclusive-OR of thesource node address
with the destination node address.
o The message is then sent along any one of the axes that the resulting binary valuewill have 1 bits
corresponding to the axes on which the two nodes differ.
• A representative of the hypercube architecture is the Intel iPSC computer complex.
o It consists of 128(n=7) microcomputers, each node consists of a CPU, a floating-point processor, local
memory, and serial communication interface units.
• There are three organizations that have been used in the design of operating system for multiprocessors: master-
slave configuration, separate operating system, and distributed operating system.
• In a master-slave mode, one processor, master, always executes the operating system functions.
• In the separate operating system organization, each processor can execute the operating system routines it needs.
This organization is more suitable for loosely coupled systems.
• In the distributed operating system organization, the operating system routines are distributed among the
available processors. However, each particular operating system function is assigned to only one processor at a
time. It is also referred to as a floating operating system.
Increase in Parallelism: The organizational changes in processor design have primarily been focused on increasing
instruction-level parallelism, so that more work could be done in each clock cycle.These changes include, in chronological
order (Figure 18.1):
• Pipelining: Individual instructions are executed through a pipeline of stages so that while one instruction is executing in
one stage of the pipeline, another instruction is executing in another stage of the pipeline.
• Superscalar: Multiple pipelines are constructed by replicating execution resources. This enables parallel execution of
instructions in parallel pipelines, so long as hazards are avoided.
• Simultaneous multithreading (SMT): Register banks are replicated so that multiple threads can share the use of pipeline
resources.
For each of these innovations, designers have over the years attempted to increase the performance of the system by adding
complexity. In the case of pipelining, simple three-stage pipelines were replaced by pipelines with five stages, and then many
more stages, with some implementations having over a dozen stages.
• There is a practical limit to how far this trend can be taken, because with more stages, there is the need for more logic,
more interconnections, and more control signals. With superscalar organization, performance increases can be achieved
by increasing the number of parallel pipelines. Again, there are diminishing returns as the number of pipelines increases.
More logic is required to manage hazards and to stage instruction resources. Eventually, a single thread of execution
reaches the point where hazards and resource dependencies prevent the full use of the multiple pipelines available. This
same point of diminishing returns is reached with SMT, as the complexity of managing multiple threads over a set of
pipelines limits the number of threads and number of pipelines that can be effectively utilized.
• Figure 18.2, from [OLUK05], is instructive in this context. The upper graph shows the exponential increase in Intel
processor performance over the years.1 The middle graph is calculated by combining Intel’s published SPEC CPU
figures and processor clock frequencies to give a measure of the extent to which performance improvement is due to
increased exploitation of instruction-level parallelism.
• There is a flat region in the late 1980s before parallelism was exploited extensively.This is followed by a steep rise as
designers were able to increasingly exploit pipelining, superscalar techniques, and SMT. But, beginning about 2000, a
new flat region of the curve appears, as the limits of effective exploitation of instruction-level parallelism are reached.
• There is a related set of problems dealing with the design and fabrication of the computer chip. The increase in
complexity to deal with all of the logical issues related to very long pipelines, multiple superscalar pipelines, and multiple
SMT register banks means that increasing amounts of the chip area is occupied with coordinating and signal transfer
logic.This increases the difficulty of designing, fabricating, and debugging the chips. The increasingly difficult
engineering challenge related to processor logic is one of the reasons that an increasing fraction of the processor chip is
devoted to the simpler memory logic. Power issues, discussed next, provide another reason.