This action might not be possible to undo. Are you sure you want to continue?
Traditionally, software has been written for serial computation: o To be run on a single computer having a single Central Processing Unit (CPU); o A problem is broken into a discrete series of instructions. o Instructions are executed one after another. o Only one instruction may execute at any moment in time.
In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: o To be run using multiple CPUs o A problem is broken into discrete parts that can be solved concurrently o Each part is further broken down to a series of instructions o Instructions from each part execute simultaneously on different CPUs
The compute resources can include: o A single computer with multiple processors; o An arbitrary number of computers connected by a network; o A combination of both. The computational problem usually demonstrates characteristics such as the ability to be: o Broken apart into discrete pieces of work that can be solved simultaneously; o Execute multiple program instructions at any moment in time; o Solved in less time with multiple compute resources than with a single compute resource.
Uses for Parallel Computing:
Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult scientific and engineering problems found in the real world. Some examples: o Atmosphere, Earth, Environment o Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics o Bioscience, Biotechnology, Genetics o Chemistry, Molecular Sciences o Geology, Seismology o Mechanical Engineering - from prosthetics to spacecraft o Electrical Engineering, Circuit Design, Microelectronics
Computer Science, Mathematics
Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example: o Databases, data mining o Oil exploration o Web search engines, web based business services o Medical imaging and diagnosis o Pharmaceutical design o Management of national and multi-national corporations o Financial and economic modeling o Advanced graphics and virtual reality, particularly in the entertainment industry o Networked video and multi-media technologies o Collaborative work environments
Why Use Parallel Computing?
Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap, commodity components. Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. For example: o "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources. o Web search engines/databases processing millions of transactions per second
Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. For example, the Access Grid (www.accessgrid.org) provides a global collaboration network where people from around the world can meet and conduct work "virtually".
Use of non-local resources: Using compute resources on a wide area network, or even the Internet when local compute resources are scarce. For example: o SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for a compute power over 528 TeraFLOPS (as of August 04, 2008) o Folding@home (folding.stanford.edu) uses over 340,000 computers for a compute power of 4.2 PetaFLOPS (as of November 4, 2008)
Limits to serial computing: Both physical and practical reasons pose significant constraints to simply building ever faster serial computers: o Transmission speeds - the speed of a serial computer is directly dependent upon how fast data can move through hardware. Absolute limits are the speed of light (30 cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond). Increasing speeds necessitate increasing proximity of processing elements. o Limits to miniaturization - processor technology is allowing an increasing number of transistors to be placed on a chip. However, even with molecular or atomic-level components, a limit will be reached on how small components can be. o Economic limitations - it is increasingly expensive to make a single processor faster. Using a larger number of moderately fast commodity processors to achieve the same (or better) performance is less expensive. Current computer architectures are increasingly relying upon hardware level parallelism to improve performance:
o o o o
Multiple execution units Pipelined instructions Multi-core
Concepts and Terminology
von Neumann Architecture
Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers. Since then, virtually all computers have followed this basic design, which differed from earlier computers programmed through "hard wiring". o Comprised of four main components: Memory Control Unit
Arithmetic Logic Unit Input/Output Read/write, random access memory is used to store both program instructions and data Program instructions are coded data which tell the computer to do something Data is simply information to be used by the program Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. Aritmetic Unit performs basic arithmetic operations
Input/Output is the interface to the human operator
Flynn's Classical Taxonomy
There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. The matrix below defines the 4 possible classifications according to Flynn:
Single Instruction, Single Data
Single Instruction, Multiple Data
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
Single Instruction, Single Data (SISD):
• • •
A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
Single Instruction, Multiple Data (SIMD):
• • • • • •
A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines
Multiple Instruction, Single Data (MISD):
• • • •
A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). Some conceivable uses might be: o multiple frequency filters operating on a single signal stream o multiple cryptography algorithms attempting to crack a single coded message.
Multiple Instruction, Multiple Data (MIMD):
• • • • •
Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution sub-components
Handler classificqation :
In computer programming, an event handler is an asynchronous callback subroutine that handles inputs received in a program. Each event is a piece of application-level information from the underlying framework, typically the GUI toolkit. GUI events include key presses, mouse movement, action selections, and timers expiring. On a lower level, events can represent availability of new data for reading a file or network stream. Event handlers are a central concept in event-driven programming. The events are created by the framework based on interpreting lower-level inputs, which may be lower-level events themselves. For example, mouse movements and clicks are interpreted as menu selections. The events initially originate from actions on the operating system level, such as interrupts generated by hardware devices, software interrupt instructions, or state changes in polling. On this level, interrupt handlers and signal handlers correspond to event handlers.
Created events are first processed by an event dispatcher within the framework. It typically manages the associations between events and event handlers, and may queue event handlers or events for later processing. Event dispatchers may call event handlers directly, or wait for events to be dequeued with information about the handler to be executed. Handling signals Signal handlers can be installed with the signal() system call. If a signal handler is not installed for a particular signal, the default handler is used. Otherwise the signal is intercepted and the signal handler is invoked. The process can also specify two default behaviors, without creating a handler: ignore the signal (SIG_IGN) and use the default signal handler (SIG_DFL). There are two signals which cannot be intercepted and handled: SIGKILL and SIGSTOP.
Signal handling is vulnerable to race conditions. Because signals are asynchronous, another signal (even of the same type) can be delivered to the process during execution of the signal handling routine. The sigprocmask() call can be used to block and unblock delivery of signals. Signals can cause the interruption of a system call in progress, leaving it to the application to manage a non-transparent restart. Signal handlers should be written in a way that doesn't result in any unwanted sideeffects, e.g. errno alteration, signal mask alteration, signal disposition change, and other global process attribute changes. Use of non-reentrant functions, e.g. malloc or printf, inside signal handlers is also unsafe.
 Relationship with Hardware Exceptions
A process's execution may result in the generation of a hardware exception, for instance, if the process attempts to divide by zero or incurs a TLB miss. In Unix-like operating systems, this event automatically changes the processor context to start executing a kernel exception handler. With some exceptions, such as a page fault, the kernel has sufficient information to fully handle the event and resume the process's execution. In other exceptions, however, the kernel cannot proceed intelligently and must instead defer the exception handling operation to the faulting process. This deferral is achieved via the signal mechanism, wherein the kernel sends to the process a signal corresponding to the current exception. For example, if a process attempted to divide by zero on an x86 CPU, a divide error exception would be generated and cause the kernel to send the SIGFPE signal to the process. Similarly, if the process attempted to access a memory address outside of its virtual address space, the kernel would notify the process of this violation via a SIGSEGV signal. The exact mapping between signal names and exceptions is obviously dependent upon the CPU, since exception types differ between architectures.
Amdahl's law and Gustafson's law
A graphical representation of Amdahl's law. The speed-up of a program from parallelization is limited by how much of the program can be parallelized. For example, if 90% of the program can be parallelized, the theoretical maximum speed-up using parallel computing would be 10x no matter how many processors are used. Optimally, the speed-up from parallelization would be linear—doubling the number of processing elements should halve the runtime, and doubling it a second time should again halve the runtime. However, very few parallel algorithms achieve optimal speed-up. Most of them have a near-linear speed-up for small numbers of processing elements, which flattens out into a constant value for large numbers of processing elements. Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Many grid computing applications have been created, of which SETI@home and Folding@Home are the best-known examples. Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common grid computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often, grid computing software makes use of "spare cycles", performing computations at times when a computer is idling.
The potential speed-up of an algorithm on a parallel computing platform is given by Amdahl's law, originally formulated by Gene Amdahl in the 1960s. It states that a small portion of the program which cannot be parallelized will limit the overall speed-up available from parallelization. Any large mathematical or engineering problem will typically consist of several parallelizable parts and several non-parallelizable (sequential) parts. This relationship is given by the equation: S= 1/1-p where S is the speed-up of the program (as a factor of its original sequential runtime), and P is the fraction that is parallelizable. If the sequential portion of a program is 10% of the runtime, we can get no more than a 10× speed-up, regardless of how many processors are
added. This puts an upper limit on the usefulness of adding more parallel execution units. "When a task cannot be partitioned because of sequential constraints, the application of more effort has no effect on the schedule. The bearing of a child takes nine months, no matter how many women are assigned." Gustafson's law is another law in computer engineering, closely related to Amdahl's law. It can be formulated as: S(p)=p- aplha(p-1)
Assume that a task has two independent parts, A and B. B takes roughly 25% of the time of the whole computation. With effort, a programmer may be able to make this part five times faster, but this only reduces the time for the whole computation by a little. In contrast, one may need to perform less work to make part A twice as fast. This will make the computation much faster than by optimizing part B, even though B got a greater speed-up (5× versus 2×). where P is the number of processors, S is the speed-up, and α the non-parallelizable part of the process. Amdahl's law assumes a fixed-problem size and that the size of the sequential section is independent of the number of processors, whereas Gustafson's law does not make these assumptions.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.