You are on page 1of 16

COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

SREE VIDYANIKETHAN ENGINEERING COLLEGE


(Autonomous)
Sree Sainath Nagar, A. Rangampet-517102
COMPUTER SCIENCE AND ENGINEERING

S.
Name of the Topic
No.
UNIT-V: PIPELINE AND VECTOR PROCESSING, MULTIPROCESSORS, MULTICORE
COMPUTERS
1. Pipeline and Vector Processing: Parallel processing
2. Pipelining
3. Instruction pipeline,
4. Vector processing
5. Array processors
6. Multiprocessors: Characteristics of multiprocessors,

7. Interconnection structures
8. Interprocessor arbitration
9. Multicore Computers: Hardware performance issues,

10 Software performance issues,


11 Multicore organization
12 Intel Core i7-990X

Pipelining and vector processing


Parallel Processing

Parallel processing can be described as a class of techniques which enables the system to achieve simultaneous
data-processing tasks to increase the computational speed of a computer system.

A parallel processing system can carry out simultaneous data-processing to achieve faster execution time. For
instance, while an instruction is being processed in the ALU component of the CPU, the next instruction can be
read from memory.

The primary purpose of parallel processing is to enhance the computer processing capability and increase its
throughput, i.e. the amount of processing that can be accomplished during a given interval of time.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 1


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

A parallel processing system can be achieved by having a multiplicity of functional units that perform identical or
different operations simultaneously. The data can be distributed among various multiple functional units.

the following diagram shows one possible way of separating the execution unit into eight functional units
operating in parallel.

The operation performed in each functional unit is indicated in each block if the diagram:

o The adder and integer multiplier performs the arithmetic operation with integer numbers.
o The floating-point operations are separated into three circuits operating in parallel.
o The logic, shift, and increment operations can be performed concurrently on different data. All units are
independent of each other, so one number can be shifted while another number is being incremented.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 2


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Pipelining

The term Pipelining refers to a technique of decomposing a sequential process into sub-operations, with each sub-
operation being executed in a dedicated segment that operates concurrently with all other segments.

The most important characteristic of a pipeline technique is that several computations can be in progress in
distinct segments at the same time. The overlapping of computation is made possible by associating a register
with each segment in the pipeline. The registers provide isolation between each segment so that each can operate
on distinct data simultaneously.

The structure of a pipeline organization can be represented simply by including an input register for each segment
followed by a combinational circuit.

Let us consider an example of combined multiplication and addition operation to get a better understanding of the
pipeline organization.

he combined multiplication and addition operation is done with a stream of numbers such as:

Ai* Bi + Ci for i = 1, 2, 3, ......., 7

The operation to be performed on the numbers is decomposed into sub-operations with each sub-operation to be
implemented in a segment within a pipeline.

The sub-operations performed in each segment of the pipeline are defined as:

R1 ← Ai, R2 ← Bi Input Ai, and Bi


R3 ← R1 * R2, R4 ← Ci Multiply, and input Ci
R5 ← R3 + R4 Add Ci to product

The following block diagram represents the combined as well as the sub-operations performed in each segment of
the pipeline.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 3


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular segment.

The output generated by the combinational circuit in a given segment is applied as an input register of the next
segment. For instance, from the block diagram, we can see that the register R3 is used as one of the input registers
for the combinational adder circuit.

In general, the pipeline organization is applicable for two areas of computer design which includes:

• Arithmetic Pipeline
• Instruction Pipeline

Instruction Pipeline

Pipeline processing can occur not only in the data stream but in the instruction stream as well.
K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 4
COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Most of the digital computers with complex instructions require instruction pipeline to carry out operations like
fetch, decode and execute instructions.

In general, the computer needs to process each instruction with the following sequence of steps.

1. Fetch instruction from memory.


2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.

Each step is executed in a particular segment, and there are times when different segments may take different
times to operate on the incoming information. Moreover, there are times when two or more segments may require
memory access at the same time, causing one segment to wait until another is finished with the memory.

The organization of an instruction pipeline will be more efficient if the instruction cycle is divided into segments
of equal duration. One of the most common examples of this type of organization is a Four-segment instruction
pipeline.

A four-segment instruction pipeline combines two or more different segments and makes it as a single one. For
instance, the decoding of the instruction can be combined with the calculation of the effective address into one
segment.

The following block diagram shows a typical example of a four-segment instruction pipeline. The instruction
cycle is completed in four segments.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 5


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Segment 1:

The instruction fetch segment can be implemented using first in, first out (FIFO) buffer.

Segment 2:

The instruction fetched from memory is decoded in the second segment, and eventually, the effective address is
calculated in a separate arithmetic circuit.

Segment 3:

An operand from memory is fetched in the third segment.

Segment 4:

The instructions are finally executed in the last segment of the pipeline organization.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 6


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Vector(Array) Processing
There is a class of computational problems that are beyond the capabilities of a conventional computer.
These problems require vast number of computations on multipledata items, that will take a conventional
computer (with scalar processor) days or even weeks to complete.
Such complex instructions, which operates on multiple data at the same time, requires abetter way of
instruction execution, which was achieved by Vector processors.
Scalar CPUs can manipulate one or two data items at a time, which is not very efficient. Also, simple
instructions like ADD A to B, and store into C are not practically efficient.
Addresses are used to point to the memory location where the data to be operated will be found, which leads
to added overhead of data lookup. So until the data is found, the CPU would be sitting ideal, which is a big
performance issue.
Hence, the concept of Instruction Pipeline comes into picture, in which the instruction passes through
several sub-units in turn. These sub-units perform various independent functions, for example: the first one
decodes the instruction, the second sub-unit fetches the data and the third sub-unit performs the math itself.
Therefore, while the datais fetched for one instruction, CPU does not sit idle, it rather works on decoding the
nextinstruction set, ending up working like an assembly line.
Vector processor, not only use Instruction pipeline, but it also pipelines the data, workingon multiple data at
the same time.
A normal scalar processor instruction would be ADD A, B, which leads to addition of two operands, but
what if we can instruct the processor to ADD a group of numbers(from 0 to n memory location) to
another group of numbers(lets say, nto kmemory location). This can be achieved by vector processors.
In vector processor a single instruction, can ask for multiple data operations, which savestime, as instruction is
decoded once, and then it keeps on operating on different data items.

Applications of Vector Processors


Computer with vector processing capabilities are in demand in specialized applications. The following are some
areas where vector processing is used:

1. Petroleum exploration.
2. Medical diagnosis.
3. Data analysis.
4. Weather forecasting.
5. Aerodynamics and space flight simulations.
6. Image processing.
7. Artificial intelligence.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 7


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Multiprocessors

8.1 Characteristics of multiprocessors


• A multiprocessor system is an interconnection of two or more CPUs with memoryand input-output
equipment.
• The term “processor” in multiprocessor can mean either a central processing unit(CPU) or an input-
output processor (IOP).
• Multiprocessors are classified as multiple instruction stream, multiple data stream
(MIMD) systems
• The similarity and distinction between multiprocessor and multicomputer are
o Similarity
▪ Both support concurrent operations
o Distinction
▪ The network consists of several autonomous computers that mayor may not
communicate with each other.
▪ A multiprocessor system is controlled by one operating system that provides interaction
between processors and all the components of the system cooperate in the solution of a
problem.
• Multiprocessing improves the reliability of the system.
• The benefit derived from a multiprocessor organization is an improved system performance.
o Multiple independent jobs can be made to operate in parallel.
o A single job can be partitioned into multiple parallel tasks.
• Multiprocessing can improve performance by decomposing a program into parallel executable tasks.
o The user can explicitly declare that certain tasks of the program be executed in parallel.
▪ This must be done prior to loading the program by specifying the parallel executable
segments.
o The other is to provide a compiler with multiprocessor software that can automatically detect
parallelism in a user’s program.
• Multiprocessor are classified by the way their memory is organized.
o A multiprocessor system with common shared memory is classified as a
shared-memory or tightly coupled multiprocessor.
▪ Tolerate a higher degree of interaction between tasks.
o Each processor element with its own private local memory is classified asa distributed-
memory or loosely coupled system.
▪ Are most efficient when the interaction between tasks is minimal

8.2 Interconnection Structures


• The components that form a multiprocessor system are CPUs, IOPs connected to input-output devices, and a
memory unit.
• The interconnection between the components can have different physical configurations,depending on the
number of transfer paths that are available
o Between the processors and memory in a shared memory system
o Among the processing elements in a loosely coupled system
• There are several physical forms available for establishing an interconnection network.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 8


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

o Time-shared common bus


o Multiport memory
o Crossbar switch
o Multistage switching network
o Hypercube system
Time Shared Common Bus
• A common-bus multiprocessor system consists of a number of processors connectedthrough a common
path to a memory unit.
• Disadv.:
o Only one processor can communicate with the memory or another processor atany given time.
o As a consequence, the total overall transfer rate within the system is limited bythe speed of the
single path
• A more economical implementation of a dual bus structure is depicted in Fig. below.
• Part of the local memory may be designed as a cache memory attached to the CPU.

Fig: Time shared common bus organization

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 9


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Fig: System bus structure for multiprocessorsa

Multiport Memory
• A multiport memory system employs separate buses between each memory module andeach CPU.
• The module must have internal control logic to determine which port will have access tomemory at any given
time.
• Memory access conflicts are resolved by assigning fixed priorities to each memory port.
• Adv.:
o The high transfer rate can be achieved because of the multiple paths.
• Disadv.:
o It requires expensive memory control logic and a large number of cables andconnections

Fig: Multiport memory organization

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 10


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Crossbar Switch
• Consists of a number of crosspoints that are placed at intersections between processorbuses and memory
module paths.
• The small square in each crosspoint is a switch that determines the path from a processorto a memory module.
• Adv.:
o Supports simultaneous transfers from all memory modules
• Disadv.:
o The hardware required to implement the switch can become quite large andcomplex.
• Below fig. shows the functional design of a crossbar switch connected to one memorymodule.

Fig: Crossbar switch

Fig: Block diagram of crossbar switch

Multistage Switching Network


• The basic component of a multistage network is a two-input, two-output interchangeswitch as shown in
Fig. below.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 11


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

• Using the 2x2 switch as a building block, it is possible to build a multistage network tocontrol the
communication between a number of sources and destinations.
o To see how this is done, consider the binary tree shown in Fig. below.
o Certain request patterns cannot be satisfied simultaneously. i.e., if P1 000~011,then P2 100~111
• One such topology is the omega switching network shown in Fig. below

Fig: 8 x 8 Omega Switching Network

• Some request patterns cannot be connected simultaneously. i.e., any two sources cannotbe connected
simultaneously to destination 000 and 001
• In a tightly coupled multiprocessor system, the source is a processor and the destinationis a memory module.
• Set up the path transfer the address into memory transfer the data
• In a loosely coupled multiprocessor system, both the source and destination areprocessing elements.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 12


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

Hypercube System
• The hypercube or binary n-cube multiprocessor structure is a loosely coupled systemcomposed of N=2n
processors interconnected in an n-dimensional binary cube.
o Each processor forms a node of the cube, in effect it contains not only a CPU butalso local memory and
I/O interface.
o Each processor address differs from that of each of its n neighbors by exactly onebit position.
• Fig. below shows the hypercube structure for n=1, 2, and 3.
• Routing messages through an n-cube structure may take from one to n links from asource node to a
destination node.
o A routing procedure can be developed by computing the exclusive-OR of thesource node address
with the destination node address.
o The message is then sent along any one of the axes that the resulting binary valuewill have 1 bits
corresponding to the axes on which the two nodes differ.
• A representative of the hypercube architecture is the Intel iPSC computer complex.
o It consists of 128(n=7) microcomputers, each node consists of a CPU, a floating-point processor, local
memory, and serial communication interface units.

Fig: Hypercube structures for n=1,2,3

8.3 Inter processor Communication and Synchronization


• The various processors in a multiprocessor system must be provided with a facility for
communicating with each other.
o A communication path can be established through a portion of memory or acommon input-output
channels.
• The sending processor structures a request, a message, or a procedure, and places it in thememory mailbox.
o Status bits residing in common memory
o The receiving processor can check the mailbox periodically.
o The response time of this procedure can be time consuming.
• A more efficient procedure is for the sending processor to alert the receiving processor directly by means of an
interrupt signal.
• In addition to shared memory, a multiprocessor system may have other shared resources. e.g., a magnetic disk
storage unit.
• To prevent conflicting use of shared resources by several processors there must be a provision for assigning
resources to processors. i.e., operating system.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 13


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

• There are three organizations that have been used in the design of operating system for multiprocessors: master-
slave configuration, separate operating system, and distributed operating system.
• In a master-slave mode, one processor, master, always executes the operating system functions.
• In the separate operating system organization, each processor can execute the operating system routines it needs.
This organization is more suitable for loosely coupled systems.
• In the distributed operating system organization, the operating system routines are distributed among the
available processors. However, each particular operating system function is assigned to only one processor at a
time. It is also referred to as a floating operating system.

Loosely Coupled System


• There is no shared memory for passing information.
• The communication between processors is by means of message passing through I/O channels.
• The communication is initiated by one processor calling a procedure that resides in the memory of the
processor with which it wishes to communicate.
• The communication efficiency of the interprocessor network depends on the communication routing
protocol, processor speed, data link speed, and the topology of the network.

Hardware Performance Issues

Increase in Parallelism: The organizational changes in processor design have primarily been focused on increasing
instruction-level parallelism, so that more work could be done in each clock cycle.These changes include, in chronological
order (Figure 18.1):

• Pipelining: Individual instructions are executed through a pipeline of stages so that while one instruction is executing in
one stage of the pipeline, another instruction is executing in another stage of the pipeline.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 14


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

• Superscalar: Multiple pipelines are constructed by replicating execution resources. This enables parallel execution of
instructions in parallel pipelines, so long as hazards are avoided.
• Simultaneous multithreading (SMT): Register banks are replicated so that multiple threads can share the use of pipeline
resources.

For each of these innovations, designers have over the years attempted to increase the performance of the system by adding
complexity. In the case of pipelining, simple three-stage pipelines were replaced by pipelines with five stages, and then many
more stages, with some implementations having over a dozen stages.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 15


COMPUTER ORGANIZATION (20BT30501) UNIT-5 II CSE

• There is a practical limit to how far this trend can be taken, because with more stages, there is the need for more logic,
more interconnections, and more control signals. With superscalar organization, performance increases can be achieved
by increasing the number of parallel pipelines. Again, there are diminishing returns as the number of pipelines increases.
More logic is required to manage hazards and to stage instruction resources. Eventually, a single thread of execution
reaches the point where hazards and resource dependencies prevent the full use of the multiple pipelines available. This
same point of diminishing returns is reached with SMT, as the complexity of managing multiple threads over a set of
pipelines limits the number of threads and number of pipelines that can be effectively utilized.
• Figure 18.2, from [OLUK05], is instructive in this context. The upper graph shows the exponential increase in Intel
processor performance over the years.1 The middle graph is calculated by combining Intel’s published SPEC CPU
figures and processor clock frequencies to give a measure of the extent to which performance improvement is due to
increased exploitation of instruction-level parallelism.
• There is a flat region in the late 1980s before parallelism was exploited extensively.This is followed by a steep rise as
designers were able to increasingly exploit pipelining, superscalar techniques, and SMT. But, beginning about 2000, a
new flat region of the curve appears, as the limits of effective exploitation of instruction-level parallelism are reached.
• There is a related set of problems dealing with the design and fabrication of the computer chip. The increase in
complexity to deal with all of the logical issues related to very long pipelines, multiple superscalar pipelines, and multiple
SMT register banks means that increasing amounts of the chip area is occupied with coordinating and signal transfer
logic.This increases the difficulty of designing, fabricating, and debugging the chips. The increasingly difficult
engineering challenge related to processor logic is one of the reasons that an increasing fraction of the processor chip is
devoted to the simpler memory logic. Power issues, discussed next, provide another reason.

K ARUN KUMAR –ASSISTANT PROFESSOR –CSE--SVEC Page 16

You might also like