Term Paper Cse 211

TERM PAPER ON COMPUTER ORGANIZATION AND ARCHITECTURE
TOPIC: PROCESSORS WITH PARALLEL ARCHITECTURE

SUBMITTED TO: SUBMITTED BY:
Nancy Goyal ROLL NO: B44 REG NO: 11011419 SECTION: K2003 Mr. Kiran kumar kaki
ACKNOWLEDGEMENT
I take opportunity to express my gratitude towards my teacher and guide MR.
KIRAN KUMAR KAKI who has helped me throughout this term paper.
He has guided me and helped me clearing out all my problems and doubt regarding my topic. I am indebted towards my seniors and friends who have assisted me and proved a helping hand in every aspect of this term paper. They have helped me in studying about this topic and hence preparing this term paper. I am thankful to all of them who have helped me in this term paper.
NANCY GOYAL
CONTENTS
INTRODUCTION CLASSIFICATION ARCHITECTURE OF PARALLEL
NEED OF PARALLEL PROCESSING TYPES OF PARALLELISM

BIT-LEVEL PARALLELISM INSTRUCTION-LEVEL PARALLELISM DATA PARALLELISM TASK PARALLELISM
HARDWARE
SISD SIMD MISD MIMD
SHARED MEMORY ORGANISATION MESSAGE PASSING ORGANISATION INTERCONNECTION NETWORKS
MODE OF OPERATION CONTROL STRATEGY SWITCHING TECHNIQUES TOPOLOGY
TYPES OF PARALLEL COMPUTERS

MULTI-CORE COMPUTING SYMMETRIC MICROPROCESSING DISRIBUTED COMPUTING CLUSTER COMPUTING MASSIVE PARALLEL PROCESSING GRID COMPUTING
SPECIALIZED PARALLEL COMPUTERS

RECONFIGUREABLE COMPUTING PROGRAMMABLE GATE ARRAYS
WITH
FIELD-
GENERAL-PURPOSE COMPUTING ON GRAPHICS PROCESSING UNITS
VECTOR PROCESSORS
APPLICATIONS PROCESSING
OF
PARALLEL
FUTURE OF PARALLEL PROCESSING REFERENCES
ABSTRACT
This term paper covers the very basics of parallel processing. It begins with a brief overview, including concepts and terminology associated with parallel computing. Then the main topics of parallel processing are explored including need of parallel processing, types of parallelism, hardware, applications and then future scope etc. In computational field technique which is used for solving the computational tasks by using different type multiple resources simultaneously is called as parallel processing. It breaks down large problem into smaller ones, which are solved concurrently. It save time and money as Parallel usage of more resources shortens the task completion time, with potential cost saving. Parallel clusters can be constructed from cheap components. Many applications require more computing power than a traditional sequential computer can offer. Parallel processing provides a cost effective solution to this problem by increasing the number of CPUs in a computer and by adding an efficient communication between them.
INTRODUCTION
Computer architects have always strived to increase the performance of their computer architectures. High performance may come from fast dense circuitry, packaging technology, and parallelism. Parallel processors are computer systems consisting of multiple processing units connected via some interconnection network plus the software needed to make the processing units work together. Processing of multiple tasks simultaneously on multiple processors is called PARALLEL PROCESSING. The parallel program consists of multiple active process simultaneously solving a given problem. A given task is divided into multiple subtasks using divide and conquer technique and each one of them are processed on different CPUs. Programming on multiprocessor system using divide and conquer technique is called Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the parallel programming. Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks.
CLASSIFICATION OF PARALLEL ARCHITECTURE

Parallel architecture can be classified into two categories: 1. Data-parallel architecture 2. Function-parallel architecture Data parallel architecture is further classified into four categories: 1. 2. 3. 4. Vector architecture Associative and neural architecture SIMDS Systolic architecture
Function parallel architecture is further classified into three categories: 1. Instruction level parallel architecture(ILPs) 2. Thread level parallel architecture 3. Process level parallel architecture(MIMDs) Instruction level parallel architecture is sub-divided into three categories: Pipelined processors
VLIWS
Superscalar processors
Process level parallel architecture is sub-divided into two categories: Distributed memory MIMD Shared memory MIMD
NEED OF PARALLEL PROCESSING
The development of parallel processing is being influenced by many factors: Hardware improvements like pipelining, superscalar are not scaling well and require sophisticated compiler technology .Developing such compiler technology is difficult task. Sequential architectures reaching physical limitation as they are constrained by the speed of light and hence an alternative way to get high computational speed is to connect multiple cpus. The computational requirements are ever increasing, both in the area of scientific and business computing. The technical computing problems ,which require high speed computational power are related to life sciences, aerospace, geographical information systems, mechanical designs etc. Significant development in networking technology is paving a way for networkbased cost-effective parallel computing. The parallel processing technology is now mature and is being exploited commercially. All computers (including desktops and laptops) are now based on parallel processing (e.g., multi-core) architecture.
Parallel processing is the processing of program instructions by dividing them among multiple processors with the objective of running a program in less time. In the earliest computers, only one program ran at a time. A com computation-intensive program that took one hour to run and a tape copying program that took one hour to run would take a total of two hours to run. An early form of parallel processing allowed the interleaved execution of both programs together. The computer would start an I/O operation, and while it was waiting for the operation to complete, it would execute the processor-intensive program. The total execution time for the two jobs would be a little over one hour.
The next improvement was multiprogramming. In a multiprogramming system, multiple programs submitted by users were each allowed to use the processor for a short time. To users it appeared that all of the programs were executing at the same time. Problems of resource contention first arose in these systems. Explicit requests for resources led to the problem of the deadlock. Competitions for resources on machines with no tie-breaking instructions lead to the critical section routine. The next step in parallel processing was the introduction of multiprocessing. In these systems, two or more processors shared the work to be done. The earliest versions had a master/slave configuration. One processor (the master) was programmed to be responsible for all of the work in the system; the other (the slave) performed only those tasks it was assigned by the master. This arrangement was necessary because it was not then understood how to program the machines so they could cooperate in managing the resources of the system.
TYPES OF PARALLELISM Levels of parallelism decided based on the lumps of code that can be a potential candidate for parallelism. Bit-level parallelism
From the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in computer architecture was driven by doubling computer word sizethe amount of information the processor can manipulate per cycle. Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, where an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction. Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose computing for two decades. Not until recently with the advent of x86-64 architectures, have 64-bit processors become commonplace.
Instruction-level parallelism
A computer program is, in essence, a stream of instructions executed by a processor. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a
processor with an N-stage pipeline can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor, with five stages: instruction fetch, decode, execute, memory access, and write back. The Pentium 4 processor had a 35-stage pipeline.
Data parallelism
Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not necessarily identical) operation sequences or functions being performed on elements of a large data structure." Many scientific and engineering applications exhibit data parallelism. A loop-carried dependency is the dependence of a loop iteration on the output of one or more previous iterations. Loop-carried dependencies prevent the parallelization of loops.
Task parallelism
Task parallelism is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data". This contrasts with data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism does not usually scale with the size of a problem.
HARDWARE
The core elements of parallel processing are CPUs. Based on a number of instruction and data streams that can be processed simultaneously , computer systems areclassified into following four categories: 1) Single Instruction Single Data (SISD) 2) Single Instruction Multiple Data (SIMD) 3) Multiple Instruction Single Data (MISD) 4) Multiple Instruction Multiple Data (MIMD)
Single Instruction Single Data (SISD)

A SISD system is a uniprocessor machine capable of executing a single instruction which operates on a single data stream. In SISD machine instructions are processes sequentially and hence computers adopting this model are popularly called
sequential computers. Most of convential computers are built using SISD model. All the instructions and data to be processed have to be stored in the primary memory.
Example: Older generation mainframes, minicomputers and workstations; most

modern day PCs
Single Instruction Multiple Data (SIMD)

It is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism
Examples: Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV
Multiple Instruction Single Data (MISD)

It is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline.
Multiple Instruction Multiple Data (MIMD)
It is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as CAD/CAM ,simulation, modeling, and as communication switches.
SHARED MEMORY ORGANISATION

A shared memory model is one in which processors communicate by reading and writing locations in a shared memory that is equally accessible by all processors. Each processor may have registers, buffers, caches and local memory banks as additional memory resources. A number of basic issues in the design of shared memory systems have to be taken into consideration. These include access control, synchronization, protection and security. Access control determines which process accesses are possible to which resources. Access control models make the required check for every request issued by the processors to the shared memory, against the contents of the access control table. Synchronization constraints limit the time of accesses from sharing processes to shared resources. Appropriate Synchronization ensures that the information flows properly and ensures system functionality.
Protection is a system feature that prevents processes from making arbitrary access belonging to other processes. Sharing and protection are incompatible: sharing allows access whereas protection restricts it.
MESSAGE PASSING ORGANISATION

Message passing systems are a class of multiprocessors in which each processor has access to its own local memory. Unlike shared memory systems, communication in message passing systems are performed via send and receive operations. A node in such a system consists of a processor and its local memory. Nodes are typically able to store messages in buffers and perform send /receive operations at the same time as processing. Simultaneously message processing and problem calculating are handled by the underlying operating system. Processors do not share a global memory and each processor has access to its own address space. The processing units of message passing system may be connected in a variety of ways ranging from architecture specific interconnection structures to geographically dispersed networks. The message passing approach is, in principle, scalable to large portions By scalable it is meant that the number of processors can be increased without significant decrease in efficiency of operation Message passing multiprocessors employ a variety of static networks in local communications
CLASSES OF PARALLEL COMPUTERS

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common.
MULTI-CORE COMPUTING
A multi-core processor is a processor that includes multiple execution units ("cores") on the same chip. These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); in contrast, a multi-core processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multi-core processor can potentially be superscalar as wellthat is, on every cycle, each core can issue multiple instructions from one instruction stream.
SYMMETRIC MULTIPROCESSORS
A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. "Because of the small size of the processors and the significant reduction in the requirements for bus bandwidth achieved by large caches, such symmetric multiprocessors are extremely cost-effective, provided that a sufficient amount of memory bandwidth exists."
DISTRIBUTED COMPUTING
A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.
CLUSTER COMPUTING
A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer. Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not. The most common type of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical commercial off-theshelf computers connected with a TCP/IP Ethernet local area network..
MASSIVE PARALLEL PROCESSING

A massively parallel processor (MPP) is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks (whereas clusters use commodity hardware for networking). MPPs also tend to be larger than clusters, typically having "far more" than 100 processors. In an MPP, "each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect."
GRID COMPUTING
Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Many grid computing applications have been created, of which SETI@home and Folding@Home are the best-known examples. Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common grid computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often, grid computing software makes use of "spare cycles", performing computations at times when a computer is idling.
INTERCONNECTION NETWORKS Multiprocessors interconnection networks (INs) can be classified based on a number of criteria. These include: MODE OF OPERATION
According to the mode of operation, INs are classified as synchronous versus asynchronous. In synchronous mode of operation, a single global clock is used by all components in the system such that the whole system is operating in a lockstep manner. Asynchronous mode of operation, on the other hand, does not require a global clock. Handshaking signals are used instead in order to coordinate the operation of asynchronous systems. While synchronous systems tend to be slower compared to asynchronous systems, they are race and hazard-free.
CONTROL STRATEGY
According to the control strategy, INs can be classified as centralized versus decentralized. In centralized control systems, a single central control unit is used to oversee and control the operation of the components of the system. In decentralized control, the control function is distributed among different components in the system.
SWITCHING TECHNIQUES
Interconnection networks can be classified according to the switching mechanism as circuit versus packet switching networks. In the circuit switching mechanism, a complete path has to be established prior to the start of communication between a source and a destination. The established path will remain in existence during the whole communication period. In a packet switching mechanism, communication between a source and destination takes place via messages that are divided into smaller entities, called packets. On their way to the destination, packets can be sent from a node to another in a store-and-forward manner until they reach their destination.
TOPOLOGY
An interconnection network topology is a mapping function from the set of processors and memories onto the same set of processors and memories. In other words, the topology describes how to connect processors and memories to other processors and memories. A fully connected topology, for example, is a mapping in which each processor is connected to all other processors in the computer. A ring topology is a mapping that connects processor k to its neighbors, processors (k -1) and (k + 1).
SPECIALIZED PARALLEL COMPUTERS

Within parallel computing, there are specialized parallel devices that remain niche areas of interest. While not domain-specific, they tend to be applicable to only a few classes of parallel problems.
Reconfigurable computing with field-programmable gate arrays

Reconfigurable computing is the use of a field-programmable gate array (FPGA) as a coprocessor to a general-purpose computer. An FPGA is, in essence, a computer chip that can rewire itself for a given task. FPGAs can be programmed with hardware description languages such as VHDL or Verilog. However, programming in these languages can be tedious. Several vendors have created C to
HDL languages that attempt to emulate the syntax and/or semantics of the C programming language, with which most programmers are familiar.
General-purpose computing on graphics processing units (GPGPU)
General-purpose computing on graphics processing units (GPGPU) is a fairly recent trend in computer engineering research. GPUs are co-processors that have been heavily optimized for computer graphics processing. Computer graphics processing is a field dominated by data parallel operationsparticularly linear algebra matrix operations. In the early days, GPGPU programs used the normal graphics APIs for executing programs. However, several new programming languages and platforms have been built to do general purpose computation on GPUs with both Nvidia and AMD releasing programming environments with CUDA and CTM respectively..
Vector processors
A vector processor is a CPU or computer system that can execute the same instruction on large sets of data. "Vector processors have high-level operations that work on linear arrays of numbers or vectors. An example vector operation is A = B C, where A, B, and C are each 64-
element vectors of 64-bit floating-point numbers." They are closely related to Flynn's SIMD classification. APPLICATIONS OF PARALLEL PROCESSING This decomposing technique is used in application requiring processing of large amount of data in sophisticated ways. For example; 1. Data bases, Data mining. 2. Networked videos and Multimedia technologies. 3. Medical imaging and diagnosis. 4. Advanced graphics and virtual reality. 5. Collaborative work environments. 6. Frameworks - Dataflow frameworks provide the highest performance and simplest method for expressing record-processing applications so that they are able to achieve high scalability and total throughput.
7. RDBMS - As the most common repositories for commercial record-oriented data,

RDBMS systems have evolved so that the Structured Query Language ( SQL ) that is used to access them is executed in parallel. The nature of the SQL language lends itself to faster processing using parallel techniques. 8. Parallelizing Compilers - For technical and mathematical applications dominated by matrix algebra , there are compiler s that can create parallel execution from seemingly sequential program source code. These compilers can decompose a program and insert the necessary message passing structures and other parallel constructs automatically
FUTURE OF PARALLEL PROCESSING It is expected to lead to other major changes in the industry. Major companies like INTEL Corp and Advanced Micro Devices Inc has already integrated four processors in a single chip. Now what needed is the simultaneous translation and break through in technologies, the race for results in parallel computing is in full swing. Another great challenge is to write a software program to divide computer processors into chunks. This could only be done with the new programming language to revolutionize the every piece of software written. Parallel computing may change the way computer work in the future and how. We use them for work and play.
REFERENCES
COMPUTER ARCHITECTURE PIPELINED AND PARALLEL PROCESSOR DESIGN WRITTEN BY: MICHEAL J. FLYNN PUBLICATION: JONES AND BARTLETT PUBLISHERS INTERNATIONAL COMPUTER PROCESSING ARCHITECTURE AND PARALLEL
WRITTEN BY: BHARAT BHUSHAN AGRAWAL AND SUMIT PRAKASH TAYAL PUBLICATION: UNIVERSITY SCIENCE PRESS ADVANCED COMPUTER PARALLEL PROCESSING BY: HESHAM EL-REWINI MOSTAFA ABD-EL-BARR ARCHITECTURE AND
www.cse.iitd.ernet.in/.../Lect01.LecJan09_2006.Introduc tion.ppt www.intel.com/ParallelProgramming en.wikipedia.org/wiki/Parallel_computing http://www.wifinotes.com/computer-networks/what-isparallel-computing.html

Term Paper Cse 211

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Term Paper Cse 211

Uploaded by

Copyright:

Available Formats

TERM PAPER ON COMPUTER ORGANIZATION AND ARCHITECTURE

TOPIC: PROCESSORS WITH PARALLEL ARCHITECTURE

INTRODUCTION CLASSIFICATION ARCHITECTURE OF PARALLEL

NEED OF PARALLEL PROCESSING TYPES OF PARALLELISM

SHARED MEMORY ORGANISATION MESSAGE PASSING ORGANISATION INTERCONNECTION NETWORKS

MODE OF OPERATION CONTROL STRATEGY SWITCHING TECHNIQUES TOPOLOGY

TYPES OF PARALLEL COMPUTERS

SPECIALIZED PARALLEL COMPUTERS

GENERAL-PURPOSE COMPUTING ON GRAPHICS PROCESSING UNITS

FUTURE OF PARALLEL PROCESSING REFERENCES

CLASSIFICATION OF PARALLEL ARCHITECTURE

NEED OF PARALLEL PROCESSING

Single Instruction Single Data (SISD)

Example: Older generation mainframes, minicomputers and workstations; most

Single Instruction Multiple Data (SIMD)

Multiple Instruction Single Data (MISD)

Multiple Instruction Multiple Data (MIMD)

SHARED MEMORY ORGANISATION

MESSAGE PASSING ORGANISATION

CLASSES OF PARALLEL COMPUTERS

MASSIVE PARALLEL PROCESSING

SPECIALIZED PARALLEL COMPUTERS

Reconfigurable computing with field-programmable gate arrays

General-purpose computing on graphics processing units (GPGPU)

7. RDBMS - As the most common repositories for commercial record-oriented data,

www.cse.iitd.ernet.in/.../Lect01.LecJan09_2006.Introduc tion.ppt www.intel.com/ParallelProgramming en.wikipedia.org/wiki/Parallel_computing http://www.wifinotes.com/computer-networks/what-isparallel-computing.html

You might also like