You are on page 1of 20

TERM PAPER

“PIPELINE ARCHITECTURE”

Submitted to:
Mansi Rana Lect. (LPU)

Submitted By:
Amit Kumar Singh MCA (D1105) RD1105-A03 Reg No. 11100823

1

Introduction
The Pipeline Architecture (PARC) project is designed for use in batch applications whose primary responsibility is translation or conversion of data between or within systems. Pipeline Architecture This section describes the basic pipeline architecture, along with two types of improvements: superpipelines and superscalar pipelines. (Pipelining and multiple issuing are not defined by the ISA, but are implementation dependent.) MIPS processors all use some variation of a pipeline in their architecture. A pipeline is divided into the following discrete • Fetch • Arithmetic operation • Memory access • Write back

What is Parallel Computing?

Traditionally, software has been written for serial computation: o To be run on a single computer having a single Central Processing Unit (CPU); o A problem is broken into a discrete series of instructions. o Instructions are executed one after another. o Only one instruction may execute at any moment in time.

2

For example: • In the simplest sense. parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: o To be run using multiple CPUs o A problem is broken into discrete parts that can be solved concurrently o Each part is further broken down to a series of instructions o Instructions from each part execute simultaneously on different CPUs 3 .

For example: • • The compute resources might be: o A single computer with multiple processors. The Universe is Parallel: • Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex. o Execute multiple program instructions at any moment in time. o Be solved in less time with multiple compute resources than with a single compute resource. o A combination of both. interrelated events happening at the same time. o An arbitrary number of computers connected by a network. For example: o Galaxy formation o Rush hour traffic o Planetary movement o Automobile assembly line o Weather and ocean o Building a jet patterns o Ordering a hamburger at the 4 . The computational problem should be able to: o Be broken apart into discrete pieces of work that can be solved simultaneously. yet within a sequence.

parallel computers still follow this basic design. o Aritmetic Unit performs basic arithmetic operations Input/Output is the interface to the human operator So what? Who cares? Well. The basic. virtually all computers have followed this basic design. von Neumann Architecture • • Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers. o Comprised of four main components:  Memory  Control Unit  Arithmetic Logic Unit  Input/Output o Read/write. fundamental architecture remains the same. just multiplied in units. decodes the instructions and then sequentially coordinates operations to accomplish the programmed task.o Tectonic plate drift drive through. random access memory is used to store both program instructions and data  Program instructions are coded data which tell the computer to do something  Data is simply information to be used by the program o Control unit fetches instructions/data from memory. o • Concepts and Terminology Flynn's Classical Taxonomy 5 . differing from earlier computers which were programmed through "hard wiring". Since then.

Each of these dimensions can have only one of two possible states: Single or Multiple. Single Data MISD Multiple Instruction. Single Data Single Instruction. NEC SX-2. in use since 1966. Multiple Data A serial (non-parallel) computer Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle Single Data: Only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today. Y-MP & C90. Multiple Data MIMD Multiple Instruction. Fujitsu VP. minicomputers and workstations. Cray X-MP. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: o Processor Arrays: Connection Machine CM-2. such as graphics/image processing. The matrix below defines the 4 possible classifications according to Flynn: SISD Single Instruction. Single Data (SISD): • • • • • SIMD Single Instruction. Single Instruction. is called Flynn's Taxonomy. particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units. the most common type of computer Examples: older generation mainframes. most modern day PCs. One of the more widely used classifications. Hitachi S820. ETA10 Most modern computers. Multiple Data (SIMD): • • • • • • • • • A type of parallel computer Single Instruction: All processing units execute the same instruction at any given clock cycle Multiple Data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity. MasPar MP-1 & MP-2. Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. ILLIAC IV o Vector Pipelines: IBM 9000. 6 .• • • There are different ways to classify parallel computers.

The result is a node with multiple CPUs. Usually comprised of multiple CPUs/processors/cores. Nodes are networked together to comprise a supercomputer. Node A standalone "computer in a box". multiple CPUs were incorporated into a node.vendor dependent.Some General Parallel Terminology Like everything else. so there is no concept of global address space across all processors. Most of these will be discussed in more detail later. The nomenclature is confused at times. In the past. each being a unique execution unit. Wonder why? Shared Memory General Characteristics: • • Processors have their own local memory. Then. Some of the more commonly used terms associated with parallel computing are listed below. individual CPUs were subdivided into multiple "cores". 7 . each containing multiple cores. depending upon who you talk to. Then. parallel computing has its own "jargon". CPU / Socket / Processor / Core This varies. a CPU (Central Processing Unit) was a singular execution component for a computer. Memory addresses in one processor do not map to another processor. Supercomputing / High Performance Computing (HPC) Using the world's fastest and largest computers to solve large problems. CPUs with multiple cores are sometimes called "sockets" .

Machine memory was physically distributed across networked machines. o SHARED memory model on a DISTRIBUTED memory machine: Kendall Square Research (KSR) ALLCACHE approach. based on global memory. any of these models can (theoretically) be implemented on any underlying hardware. When a processor needs access to data in another processor.• • • Because each processor has its own local memory. these models are NOT specific to a particular type of machine or memory architecture. Changes it makes to its local memory have no effect on the memory of other processors. but 8 . Non-uniform memory access (NUMA) times Parallel Programming Models Overview • • • There are several parallel programming models in common use: o Shared Memory (without threads) o Threads o Distributed Memory / Message Passing o Data Parallel o Hybrid o Single Program Multiple Data (SPMD) o Multiple Program Multiple Data (MPMD) Parallel programming models exist as an abstraction above hardware and memory architectures. Advantages: • • • Memory is scalable with the number of processors. Disadvantages: • • • The programmer is responsible for many of the details associated with data communication between processors. The network "fabric" used for data transfer varies widely. Synchronization between tasks is likewise the programmer's responsibility. Although it might not seem apparent. Increase the number of processors and the size of memory increases proportionately. Cost effectiveness: can use commodity. Two examples from the past are discussed below. It may be difficult to map existing data structures. In fact. it is usually the task of the programmer to explicitly define how and when data is communicated. it operates independently. though it can can be as simple as Ethernet. off-the-shelf processors and networking. to this memory organization. the concept of cache coherency does not apply. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Hence.

this is straightforward. this approach is referred to as "virtual shared memory". and also discuss some of their actual implementations. although there certainly are better implementations of some models over others. Which model to use? This is often a combination of what is available and personal choice. but made global through specialized hardware and software. Various mechanisms such as locks / semaphores may be used to control access to the shared memory. tasks share a common address space. where every task has direct access to global address space spread across all machines. Generically. There is no "best" model. was implemented and commonly used. which they read and write to asynchronously. Program development can often be simplified. Implementations: • • Native compilers and/or hardware translate user program variables into actual memory addresses.appeared to the user as a single shared memory (global address space). • • Parallel Programming Models Shared Memory Model (without threads) • • • • In this programming model. memory is physically distributed across a network of machines. On distributed shared memory machines. 9 . The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture. o Unfortunately. An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking. On stand-alone SMP machines. which are global. cache refreshes and bus traffic that occurs when multiple processors use the same data. The following sections describe each of the models mentioned above. the ability to send and receive messages using MPI. so there is no need to specify explicitly the communication of data between tasks. o DISTRIBUTED memory model on a SHARED memory machine: Message Passing Interface (MPI) on SGI Origin 2000. An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality. However. such as the SGI Origin. o Keeping data local to the processor that works on it conserves memory accesses. as is commonly done over a network of distributed memory machines. controlling data locality is hard to understand and beyond the control of the average user.

and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently. In the threads model of parallel programming. but a.out is scheduled to run by the native operating system.out. o a. Each thread also benefits from a global memory view because it shares the memory space of a. Implementations: • From a programming perspective. concurrent execution paths. o A thread's work may best be described as a subroutine within the main program.out load s and acquires all of the necessary system and user resources to run. o Each thread has local data. but also. a. threads implementations commonly comprise: o A library of subroutines that are called from within parallel source code o A set of compiler directives imbedded in either serial or parallel source code 10 . This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time.out performs some serial work.out remains present to provide the necessary shared resources until the application has completed. shares the entire resources of a. This saves the overhead associated with replicating a program's resources for each thread. Perhaps the most simple analogy that can be used to describe threads is the concept of a single program that includes a number of subroutines: o The main program a.out. o Threads communicate with each other through global memory (updating address locations). a single process can have multiple. o Threads can come and go.Parallel Programming Models Threads Model • • • This programming model is a type of shared memory programming. Any thread can execute any subroutine at the same time as other threads.

can use serial code o Jointly defined and endorsed by a group of major computer hardware and software vendors. which is not related to the UNIX POSIX standard or OpenMP. The C/C++ API was released in late 1998. requires parallel coding o Specified by the IEEE POSIX 1003. o C Language only o Commonly referred to as Pthreads. o Portable / multi-platform. the programmer is responsible for determining all parallelism. These implementations differed substantially from each other making it difficult for programmers to develop portable threaded applications. including Unix and Windows NT platforms o Available in C/C++ and Fortran implementations o Can be very easy and simple to use . 1997. POSIX Threads o Library based. o Very explicit parallelism. requires significant programmer attention to detail. o Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations.gov/tutorials/openMP Parallel Programming Models Distributed Memory / Message Passing Model 11 . Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP.In both cases.gov/tutorials/pthreads OpenMP tutorial: computing. • • • • • Threaded implementations are not new in computing. More Information: • • POSIX Threads tutorial: computing.provides for "incremental parallelism" Microsoft has its own implementation for threads. Historically. The OpenMP Fortran API was released October 28. hardware vendors have implemented their own proprietary versions of threads.1c standard (1995).llnl. OpenMP o Compiler directive based.llnl.

Historically. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. These implementations differed substantially from each other making it difficult for programmers to develop portable applications.gov/mpi/ . replacing virtually all other message passing implementations used for production work.• This model demonstrates the following characteristics: o A set of tasks that use their own local memory during computation. o Tasks exchange data through communications by sending and receiving messages. message passing implementations usually comprise a library of subroutines. MPI is now the "de facto" industry standard for message passing.anl. Not all implementations include everything in both MPI1 and MPI2.gov/tutorials/mpi 12 . o Data transfer usually requires cooperative operations to be performed by each process. More Information: • MPI tutorial: computing. the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations. MPI implementations exist for virtually all popular parallel computing platforms. The programmer is responsible for determining all parallelism. For example. Implementations: • • • • • From a programming perspective.mcs. Part 1 of the Message Passing Interface (MPI) was released in 1994.llnl. a send operation must have a matching receive operation. Calls to these subroutines are imbedded in source code. a variety of message passing libraries have been available since the 1980s. In 1992. Both MPI specifications are available on the web at http://www-unix. Part 2 (MPI-2) was released in 1996.

o Tasks perform the same operation on their partition of work. o A set of tasks work collectively on the same data structure. The data set is typically organized into a common structure. "add 4 to every array element". On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task. each task works on a different partition of the same data structure. however. for example. all tasks may have access to the data structure through global memory. such as an array or cube. Parallel Programming Models SPMD and MPMD Single Program Multiple Data (SPMD): 13 .Parallel Programming Models Data Parallel Model • The data parallel model demonstrates the following characteristics: o Most of the parallel work focuses on • performing operations on a data set. On shared memory architectures.

message passing. MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. MULTIPLE DATA: All tasks may use different data MPMD applications are not as common as SPMD applications. The programs can be threads. using message passing or hybrid programming. Multiple Program Multiple Data (MPMD): • • • • Like SPMD. data parallel or hybrid. MULTIPLE DATA: All tasks may use different data SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute.• • • • • SPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. Manual Parallelization 14 . but may be better suited for certain types of problems. tasks do not necessarily have to execute the entire program .perhaps only a portion of it. The SPMD model. particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning). That is. MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads. is probably the most commonly used parallel programming model for multi-node clusters. message passing. Designing Parallel Programs Automatic vs. data parallel or hybrid.

If you are beginning with an existing serial code and have time or budget constraints. o Loops (do. This is known as decomposition or partitioning. then automatic parallelization may be the answer. The programmer is typically responsible for both identifying and actually implementing parallelism. For a number of years now. 15 . Designing Parallel Programs Partitioning • One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. The most common type of tool used to automatically parallelize a serial program is a parallelizing compiler or pre-processor. o May be able to be used in conjunction with some degree of automatic parallelization also. various tools have been available to assist the programmer with converting serial programs into parallel programs. manually developing parallel codes is a time consuming. the programmer explicitly tells the compiler how to parallelize the code. However. there are several important caveats that apply to automatic parallelization: o Wrong results may be produced o Performance may actually degrade o Much less flexible than manual parallelization o Limited to a subset (mostly loops) of code o May actually not parallelize code if the analysis suggests there are inhibitors or the code is too complex The remainder of this section applies to the manual method of developing parallel codes.• • • • • • Designing and developing parallel programs has characteristically been a very manual process. complex. error-prone and iterative process. o Programmer Directed o Using "compiler directives" or possibly compiler flags. o The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the parallelism would actually improve performance. for) loops are the most frequent target for automatic parallelization. A parallelizing compiler generally works in two different ways: o Fully Automatic o The compiler analyzes the source code and identifies opportunities for parallelism. Very often.

Domain Decomposition: • In this type of partitioning. The problem is decomposed according to the work that must be done. 16 . Each task then performs a portion of the overall work.• There are two basic ways to partition computational work among parallel tasks: domain decomposition and functional decomposition. the data associated with a problem is decomposed. the focus is on the computation that is to be performed rather than on the data manipulated by the computation. Functional Decomposition: • In this approach. Each parallel task then works on a portion of of the data.

where each group's growth depends on that of its neighbors.• Functional decomposition lends itself well to problems that can be split into different tasks. all four tasks are busy. All tasks then progress to calculate the state at the next time step. each process calculates its current state. Signal Processing An audio signal data set is passed through four distinct computational filters. then exchanges information with the neighbor populations. Each filter is a separate process. 17 . As time progresses. The first segment of data must pass through the first filter before progressing to the second. the second segment of data passes through the first filter. When it does. For example: Ecosystem Modeling Each program calculates the population of a given group. By the time the fourth segment of data is in the first filter.

unit stride (stride of 1) through the subarrays. • Array Processing Parallel Solution 1 • • • • Arrays elements are distributed so that each processor owns a portion of an array (subarray). Distribution scheme is chosen by other criteria.leads to an embarrassingly parallel situation. The problem should be computationally intensive. Serial code could be of the form: do j = 1. Since it is desirable to have unit stride through the subarrays.Designing Parallel Programs array Processing • • • This example demonstrates calculations on 2-dimensional array elements.j) end do end do • The calculation of elements is independent of one another . Independent calculation of array elements ensures there is no need for communication between tasks. Unit stride maximizes cache/memory usage. e.j) = fcn(i.g. with the computation on each array element being independent from other array elements. the choice of a distribution scheme depends on the programming language.n a(i. The serial program calculates one element at a time in sequential order.n do i = 1. See the Block 18 .

In the last third. It also caters to computer science and engineering students who want to understand and grapple with the key issues in the design of parallel architectures and software systems. each task executes the portion of the loop corresponding to the data it owns. there will be a significant focus on the modern trend toward increasingly more parallel multicore processors within a single chip. The inevitable and rapidly growing adoption of multi-core parallel architectures within a processor chip by all of the computer industry pushes explicit parallelism to the forefront of computing for all applications and scales. with Fortran block distribution: do j = mystart. This course caters to students from all departments who are interested in using parallel computers of various scales to speed up the solution of problems. The first two thirds of the course will focus on the key issues in parallel programming and architecture. ranging from scientific computing to information and data services. and makes the challenge of parallel programming and system understanding all the more crucial. After the array is distributed.• Cyclic Distributions Diagram for the options. A significant theme of the treatment of systems issues is their implications for application software. drawing the connection between the two and thereby making the systems issues relevant to users of such systems as well. Summary Parallel computing is a mainstay of modern computation and information analysis and management. In addition to general programming and systems. we will examine some advanced topics ranging from methods to tolerate latency to programming models for clustered commodity systems to new classes of information applications and services that strongly leverage large-scale 19 .j) end do end do • Notice that only the outer loop variables are different from the serial solution. The challenge of programming parallel systems has been highlighted as one of the three greatest challenges for the computer industry by leaders of even the largest desktop companies.n a(i. myend do i = 1.j) = fcn(i. For example.

Lipasti (2004). Raul Rojas (1997). ^ John Paul Shen.asp?Volname=Issue+ %23171&on=1#item13 20 . 3. Modern Processor Design. IEEE Annals of the History of Computing. McGraw-Hill Professional. either with an application they propose from their area of interest or with one that is suggested by the instructors. ^ http://www. 2. References 1. Students will do a parallel programming project. ""Konrad Zuse's Legacy: The Architecture of the Z1 and Z3"".com/watch/watch_Issue.parallel systems.mdronline. Mikko H. Students will have dedicated access to two kinds of multi-core processor systems in addition to large-scale multiprocessors for their projects.