Professional Documents
Culture Documents
The second generation (1952-1963) Transistors were invented in 1948. The first transistorized digital computer, TRAD!C, was
built by Bell Laboratories in 1954.Discrete transistors and diodes were the building blocks: 800 transistors were
used in TRADIC. Printed circuits appeared
The third generation (1962-1975) This generation was marked by the use of small-scale integrated (SSI) and medium-scale
integrated (MSI) circuits as the basic building blocks. Multilayered printed circuits were used. Core memory was
still used in CDC-6600 and other machines but. by 1968, many fast computers, like CDC-7600, began to replace cores with solid-
state memories.
The fourth generation (1972-present) The present generation computers emphasize the use of large-scale integrated (LSI) circuits
for both logic and memory sections. High-density packaging has appeared. High-level languages are being extended to handle both
scalar and vector data. like the extended Fortran in many vector processors,
The future Computers to be used in the 1990s may be the next generation. Very large-scale integrated (VLSI) chips will be used
along with high-density modular design. Multiprocessors like the 16 processors in the S-1 project at Lawrence Livermore National
Laboratory and in the Denelcor's HEP will be required.Cray-2 is expected to have four processors, to be delivered in 1985. More
than 1000mega floating point operations per second (megaflops) are expected in these future supercomputers.
Data processing
Information processing Knowledge processing Intelligence processing
By GS kosta
We are in an era which is promoting the use ofcomputers not only for conventional data-information processing. but also towardthe
building of workable machine knowledge-intelligence systems to advance human civilization. Many computer scientists feel that the
degree of parallelism exploitable at the two highest processing levels should be higher than that at thedata-information processing
levels
From an operating system point of view, computer systems have improved chronologically in four phases:
Batch processing
Multiprogramming
Time sharing
Multiprocessing
In these four operating modes. the degree of parallelism increases sharply from phase to phase. The general trend is to emphasize
parallel processing of information. In what follows. the term information is used with an extended meaning to include
data, information, knowledge, and intelligence. We formally define parallel processing as follows:
Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the
computing process. Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may occur in multiple resources
during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in
overlapped time spans.
parallel processing is concerned, the general architectural trend is being shifted away from conventional uniprocessor systems to
multiprocessor systems or to an array of processing elements controlled by one uniprocessor. In all cases, a high degree of pipe
lining is being incorporated into the various system levels.
By GS kosta
The architectures of two commercially available uniprocessor computers are given below to show the possible
interconnection of structures among the three subsystems. We will examine major components in the CPU and in
the 1/0 subsystem.
Multiplicity of functional units The early computer had only one arithmetic and logic unit in its CPU. Furthermore. the ALU could
only perform one function at a time, a rather slow process for executing a long sequence of arithmetic logic instructions. In practice,
many of the functions of the ALU can be distributed to multiple and specialized functional units which can operate in parallel. The
CDC-6600 (designed in 1964) has I 0 functional units built into its CPU (Figure 1.5). These I 0 units are independent of each other
and may operate simultaneously. A scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available, the
instruction issue rate can be significantly increased.
Another good example of a multifunction uniprocessor is the IBM 360/91 (1968), which has two parallel execution units
By GS kosta
Overlapped CPU and 1/0 operations 1/0 operations can be performed simultaneously with the CPU computations by using
separate 1/0 controllers, channels, or I/0 processors. The direct-memory-access (DMA) channel can be used to provide direct
information transfer between the 1/0 devices and the main memory. The DMA is conducted on a cycle-stealing basis, which is
apparent to the CPU.
By GS kosta
SISD computer organization This organization, shown in figure 1.16a, represents most serial computers available today.
Instructions are executed sequentially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor
systems are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are under the
supervision of one control unit.
SIMD computer organization This class corresponds to array processors. Introduced in Section 1.3.2. As illustrated in Figure
1.16b, there are multiple processing elements supervised by the same control unit. All PEs receive the same instruction broadcast
from the control unit but operate on different data sets from distinct data streams. The shared memory subsystem may contain
multiple modules.
MISD computer organization This organization is conceptually illustrated in Figure l.l6c. There are n processor units, each
receiving distinct instructions operating over the same data stream and its derivatives. The results (output) of one processor become
the input (operands) of the next processor in the micropipe. This structure has received much less attention and has been challenged
as impractical by some computer architects. No real embodiment of this class exists.
MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category
(Figure 1.16d). An intrinsic MIMD computer implies
interactions among then processors because all memory
streams are derived from the same data space shared by
all processors. If the n data streams were derived from
disjointed subspaces of the shared memories, then we
would have the so-called multiple SISD (MSISD)
operation, which is nothing but a set of n
independent SISD uniprocessor systems.
parallelism to
classify various
computer
architectures.
There are four types of processing methods that can be seen from this diagram:
Word-serial and bit-serial (WSBS)
Word-parallel and bit-serial (WPBS)
Word-serial and bit-parallel (WSBP)
By GS kosta
Word-parallel and bit-parallel (WPBP)
WSBS has been called bit-serial processing because one bit (n = m = 1) is processed at a time. a rather slow process. This was
done only in the first~generation computers. WPBS (n = 1, m > 1) has been called bis (bit-slice) processin9 because an m-bit slice is
processed at a time. WSBP (n > 1, m = l ), as found in most existing computers, has been called word-slice processi11g because one
word of 11 bits is processed at a time. Finally, WPBP (n > I, m > l) is known as fully parallel processing(or simply parallel
processing, if no confusion exists), in which an array of n m bits is processed at one time, the fastest processing mode of the four.
In Table 1.4, we have listed a number of computer systems under each processing mode. The system parameters n, m are also shown
for each system. The bit-slice processors, like STARAN. ~PP, and DAP. all have long bit slices. llliac-IV and PEPE are two word-
slice array processors.
The functions of PCU and ALU should be clear to us. Each PCU corresponds to one processor or one CPU. The ALU is equivalent
to the processing element (PE)we specified for SIMD array processors. The BLC corresponds to the combinational logic circuitry
needed to perform !-bit operations in the ALU. A computer system C can be characterized by a triple containing six independent
entities. as defined below:
T(C) = < K x K', D x D', W x W'> (1.13)
where K = the number of processors (PCUs) within the computer
D = the number of ALUs (or PEs) under the control of one PCU
W = the word length of an ALU or of a PE
W' =the number of pipeline stages in all ALUs or in a PE
D' = the number of ALUs that can be pipelined (pipeline chaining to be
described in Chapter 4)
K' = the number of PCUs that can be pipelined
Several real computer examples are used to clarify the above parametric descriptions. The Texas Instrument's Advanced Scientific
Computer (Tl-ASC)has one controller controlling four arithmetic pipelines, each has 64-bit word lengths and eight stages. Thus, we
have
T(ASC) = <1 x 1, 4 X 1, 64 x 8> = < 1, 4, 64 x 8>
By GS kosta
By GS kosta
Amdahl's law
In computer architecture, Amdahl's law (or Amdahl's
argument[1]) is a formula which gives the
theoretical speedup in latency of the execution of a
task at fixed workload that can be expected of a
system whose resources are improved. It is named
after computer scientist Gene Amdahl, and was
presented at the AFIPS Spring Joint Computer
Conference in 1967
Amdahl's law is often used in parallel computing to
predict the theoretical speedup when using multiple
processors.
By GS kosta
Moores Law
The quest for higher-performance digital computers seems unending. In
the past two decades, the performance of microprocessors has enjoyed an
exponential growth. The growth of microprocessor speed/performance by
a factor of 2 every 18 months (or about 60% per
year) is known as Moores law. This growth is the result of a combination
of two factors:
1. Increase in complexity (related both to higher device density and to
larger size) of VLSI chips, projected to rise to around 10 M transistors per
chip for microprocessors, and 1B for dynamic random-access memories
(DRAMs), by the year 2000 [SIA94]
2. Introduction of, and improvements in, architectural features such as on-
chip cache memories, large instruction buffers, multiple instruction issue
per cycle, multithreading, deep pipelines, out-of-order instruction
execution, and branch prediction
Moores law was originally formulated in 1965 in terms of the doubling
of chip complexity every year (later revised to every 18 months) based
only on a small number of data points [Scha97]. Moores revised
prediction matches almost perfectly the actual increases in the
number of transistors in DRAM and microprocessor chips.
Moores law seems to hold regardless of how one measures
processor performance: counting the number of executed Figure 1.1. The exponential growth of microprocessor performance,
instructions per second (IPS), counting the number of floating-point known as Moores law, shown over the past two decades.
operations per second (FLOPS), or using sophisticated benchmark
suites that attempt to measure the processor's performance on real
applications. This is because all of these measures, though
By GS kosta
numerically different, tend to rise at roughly the same rate .Figure 1.1 shows that the performance of actual processors has in fact followed
Moores law quite closely since 1980 and is on the verge of reaching the GIPS (giga IPS = 10 9 IPS) milestone
A basic block is a sequence or block of instructions with one entry and one exit.
Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of
registers utilized in the block).
Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in
typical code).
1.1.7. Asymptotic Speedup 1 1.1.8. Asymptotic Speedup 2
By GS kosta
1.2. Mean Performance
We seek to obtain a measure that characterizes the mean, or average, performance of a set of
benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).
We may also wish to associate weights with these programs to emphasize these different modes and yield a more
meaningful performance measure.
1.2.1. Arithmetic Mean
The arithmetic mean is familiar (sum of the terms divided by the number of terms).
Our measures will use execution rates expressed in MIPS or Mflops.
The arithmetic mean of a set of execution rates is proportional to the sum of the inverses of the execution times; it
is not inversely proportional to the sum of the execution times.
Thus arithmetic mean fails to represent real times consumed by the benchmarks when executed.
The implication is that the best speedup possible is 1/ a, regardless of n, the number of processors.
1.3. Efficiency, Utilizations, and Quality
1.3.1. System Efficiency 1
By GS kosta
Assume the following definitions:
O (n) = total number of unit operations performed by an n processor system in completing a program P.
T (n) = execution time required to execute the program P on an n processor system.
O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by
a constant factor.
If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make
any use at all of the extra processor(s).
1.3.2. System Efficiency 2
Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as
S (n) = T (1) / T (n)
Recall that we expect T (n) < T (1), so S (n) 1.
System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n T (n) )
It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup.
Thus 1 / n E (n) 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
1.3.3. Redundancy
The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)
What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of
processors, n. This is the ideal case.
R (n) = n when all processors performs the same number of operations as when only a single processor is used; this
implies that n completely redundant computations are performed!
The R (n) figure indicates to what extent the software parallelism is carried over to the hardware implementation
without having extra operations performed.
By GS kosta
The ideal speed up formula given below:
Parallel algorithm
In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can
be executed a piece at a time on many different processing devices, and then combined together again at the end
to get the correct result.[1]
Many parallel algorithms are executed concurrently though in general concurrent algorithms are a distinct
concept and thus these concepts are often conflated, with which aspect of an algorithm is parallel and which is
concurrent not being clearly distinguished. Further, non-parallel, non-concurrent algorithms are often referred to
as "sequential algorithms", by contrast with concurrent algorithms.
By GS kosta
Total number of processors used, and
Total cost.
Time Complexity
The main reason behind developing parallel algorithms was to reduce the computation time of an algorithm. Thus, evaluating the
execution time of an algorithm is extremely important in analyzing its efficiency.
Execution time is measured on the basis of the time taken by the algorithm to solve a problem. The total execution time is calculated
from the moment when the algorithm starts executing to the moment it stops. If all the processors do not start or end execution at the
same time, then the total execution time of the algorithm is the moment when the first processor started its execution to the moment
when the last processor stops its execution.
Time complexity of an algorithm can be classified into three categories
Worst-case complexity When the amount of time required by an algorithm for a given input is maximum.
Average-case complexity When the amount of time required by an algorithm for a given input is average.
Best-case complexity When the amount of time required by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of steps executed by the algorithm to get the desired output. Asymptotic
analysis is done to calculate the complexity of an algorithm in its theoretical analysis. In asymptotic analysis, a large length of input
is used to calculate the complexity function of the algorithm.
Note Asymptotic is a condition where a line tends to meet a curve, but they do not intersect. Here the line and the curve is
asymptotic to each other.
Asymptotic notation is the easiest way to describe the fastest and slowest possible execution time for an algorithm using high bounds
and low bounds on speed. For this, we use the following notations
Big O notation
Omega notation
Theta notation
Big O notation
In mathematics, Big O notation is used to represent the asymptotic characteristics of functions. It represents the behavior of a function
for large inputs in a simple and accurate method. It is a method of representing the upper bound of an algorithms execution time. It
represents the longest amount of time that the algorithm could take to complete its execution. The function
f(n) = O(g(n))
iff there exists positive constants c and n0 such that f(n) c * g(n) for all nwhere n n0.
Omega notation
Omega notation is a method of representing the lower bound of an algorithms execution time. The function
f(n) = (g(n))
iff there exists positive constants c and n0 such that f(n) c * g(n) for all nwhere n n0.
Theta Notation
Theta notation is a method of representing both the lower bound and the upper bound of an algorithms execution time. The function
f(n) = (g(n))
iff there exists positive constants c1, c2, and n0 such that c1 * g(n) f(n) c2 * g(n) for all n where n n0.
Speedup of an Algorithm
The performance of a parallel algorithm is determined by calculating its speedup. Speedup is defined as the ratio of the worst-case
execution time of the fastest known sequential algorithm for a particular problem to the worst-case execution time of the parallel
algorithm.
speedup = Worst case execution time of the fastest known sequential for a particular problem / Worst case execution time of the parallel
algorithm
By GS kosta
Number of Processors Used
The number of processors used is an important factor in analyzing the efficiency of a parallel algorithm. The cost to buy, maintain,
and run the computers are calculated. Larger the number of processors used by an algorithm to solve a problem, more costly becomes
the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time complexity and the number of processors used in that particular algorithm.
Total Cost = Time complexity Number of processors used
Therefore, the efficiency of a parallel algorithm is
Efficiency = Worst case execution time of sequential algorithm / Worst case execution time of the parallel algorithm
Within the MIMD class, three fundamental issues or design choices are subjects of ongoing debates in the research community:
1. MPPmassively or moderately parallel processor. Is it more cost-effective to build a parallel processor out of a relatively small
number of powerful processors or a massive number of very simple processors (the herd of elephants or the army of ants
approach)? Referring to Amdahls law, the first choice does better on the inherently sequential part of a computation while the
second approach might allow a higher speed-up for the parallelizable part. A general answer cannot be given to this question, as
the best choice is both application- and technology-dependent.
2. Tightly versus loosely coupled MIMD. Which is a better approach to high-performance computing, that of using specially
designed multiprocessors/multicomputer or a collection of ordinary workstations that are interconnected by commodity networks
(such as Ethernet or ATM) and whose interactions are coordinated by special system software and distributed file systems? The
latter choice, sometimes referred to as network of workstations (NOW) or cluster computing, has been gaining popularity in
recent years. However, many open problems exist for taking full advantage of such network-based loosely coupled architectures.
The hardware, system software, and applications aspects of NOWs are being investigated by numerous research groups.
3. .Explicit message passing versus virtual shared memory. Which scheme is better, that of forcing the users to explicitly specify all
messages that must be sent between processors or to allow them to program in an abstract higher-level model, with the required
messages automatically generated by the system software? This question is essentially very similar to the one asked in the early
days of high-level languages and virtual memory. At some point in the past, programming in assembly languages and doing
explicit transfers between secondary and primary memories could lead to higher efficiency. However, nowadays, software is so
complex and compilers and operating systems so advanced (not to mention processing power so cheap) that it no longer makes
sense to hand-optimize the programs, except in limited time-critical instances. However, we are not yet at that point in parallel
processing, and hiding the explicit communication structure of a parallel machine from the programmer has nontrivial
consequences for performance
By GS kosta
THE PRAM SHARED-MEMORY MODEL
The theoretical model used for conventional or sequential computers (SISD class) is
known as the random-access machine (RAM) (not to be confused with random-access
memory, which has the same acronym). The parallel version of RAM [PRAM (pea-ram)],
constitutes an abstract model of the class of global-memory parallel processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection
network and taking the view that each processor can access any memory location in each
machine cycle, independent of what other processors are doing
By GS kosta
Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collection of p processors, each with its own private memory,
communicates through an interconnection network. Here, the latency of the
interconnection network may be less critical, as each processor is likely to
access its own local memory most of the time. However, the communication
bandwidth of the network may or may not be critical, depending on the type of
parallel applications and the extent of task interdependencies. Note that each
processor is usually connected to the network through multiple links or channels
(this is the norm here ,although it can also be the case for shared-memory
parallel processors).
Cache coherence
In computer architecture, cache coherence is the uniformity
of shared resource data that ends up stored in multiple local
caches. When clients in a system maintain caches of a
common memory resource, problems may arise with
incoherent data, which is particularly the case with CPUs in
a multiprocessingsystem.
In the illustration on the right, consider both the clients have
a cached copy of a particular memory block from a previous
read. Suppose the client on the bottom updates/changes
that memory block, the client on the top could be left with an
invalid cache of memory without any notification of the
change. Cache coherence is intended to manage such
conflicts by maintaining a coherent view of the data values in
multiple caches.
The following are the requirements for cache coherence:[2]
Write Propagation
Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer
caches.
Transaction Serialization
Reads/Writes to a single memory location must be seen by all processors in the same order.
Coherence protocols
Coherence Protocols apply cache coherence in multiprocessor systems. The intention is that two clients must
never see different values of the same shared data.
The protocol must implement the basic requirements for coherence. It can be tailor made for the target
system/application.
Protocols can also be classified as Snooping(Snoopy/Broadcast) or Directory based. Typically, early systems
used directory based protocols where a directory would keep a track of the data being shared and the sharers. In
Snoopy protocols , the transaction request. (read/write/upgrade) are sent out to all processors. All processors
snoop the request and respond appropriately.
Write Propagation in Snoopy protocols can be implemented by either of the following:
Write Invalidate
When a write operation is observed to a location that a cache has a copy of, the cache controller
invalidates its own copy of the snooped memory location, and thus forcing reads from main memory of the
new value on their next access.[4]
Write Update
When a write operation is observed to a location that a cache has a copy of, the cache controller updates
its own copy of the snooped memory location with the new data.
By GS kosta
If the protocol design states that whenever any copy of the shared data is changed, all the other copies
must be "updated" to reflect the change, then it is a write update protocol. If the design states that on a
write to a cached copy by any processor requires other processors to discard/invalidate their cached
copies, then it is a write invalidate protocol.
However, scalability is one shortcoming of broadcast protocols.
Various models and protocols have been devised for maintaining coherence.
Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task performs similar
types of operations on different data. Data parallelism is a consequence of single
operations that is being applied on multiple data items.
The primary characteristic of data-parallel model problems is that the intensity of data
parallelism increases with the size of the problem, which in turn makes it possible to
use more processes to solve larger problems.
By GS kosta
Task Graph Model
In the task graph model, parallelism is expressed by a task graph. A task graph can
be either trivial or nontrivial. In this model, the correlation among the tasks are utilized
to promote locality or to minimize interaction costs. This model is enforced to solve
problems in which the quantity of data associated with the tasks is huge compared to
the number of computation associated with them. The tasks are assigned to help
improve the cost of data movement among the tasks.
Examples Parallel quick sort, sparse matrix factorization, and parallel algorithms
derived via divide-and-conquer approach.
Here, problems are divided into atomic tasks and implemented as a graph. Each task is
an independent unit of job that has dependencies on one or more antecedent task. After
By GS kosta
the completion of a task, the output of an antecedent task is passed to the dependent
task. A task with antecedent task starts execution only when its entire antecedent task
is completed. The final output of the graph is received when the last dependent task is
completed (Task 6 in the above figure).
The task may be available in the beginning, or may be generated dynamically. If the
task is generated dynamically and a decentralized assigning of task is done, then a
termination detection algorithm is required so that all the processes can actually detect
the completion of the entire program and stop looking for more tasks.
Master-Slave Model
In the master-slave model, one or more master processes generate task and allocate it
to slave processes. The tasks may be allocated beforehand if
By GS kosta
the master can estimate the volume of the tasks, or
In some cases, a task may need to be completed in phases, and the task in each phase
must be completed before the task in the next phases can be generated. The master-
slave model can be generalized to hierarchical or multi-level master-slave
model in which the top level master feeds the large portion of tasks to the second-level
master, who further subdivides the tasks among its own slaves and may perform a part
of the task itself.
The tasks should be selected in a way that the cost of performing a task dominates the
cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation associated
with work generation by the master.
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is passed on
through a series of processes, each of which performs some task on it. Here, the arrival
of new data generates the execution of a new task by a process in the queue. The
processes could form a queue in the shape of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
This model is a chain of producers and consumers. Each process in the queue can be
considered as a consumer of a sequence of data items for the process preceding it in
the queue and as a producer of data for the process following it in the queue. The queue
does not need to be a linear chain; it can be a directed graph. The most common
By GS kosta
interaction minimization technique applicable to this model is overlapping interaction
with computation.
Hybrid Models
A hybrid algorithm model is required when more than one model may be needed to
solve a problem.
In hardware[edit]
In computer hardware, shared memory refers to a (typically large) block of random access memory (RAM) that
can be accessed by several different central processing units (CPUs) in a multiprocessor computer system.
Shared memory systems may use:[1]
uniform memory access (UMA): all the processors share the physical memory uniformly;
non-uniform memory access (NUMA): memory access time depends on the memory location relative to a
processor;
cache-only memory architecture (COMA): the local memories for the processors at each node is used as
cache instead of as actual main memory.
In software[edit]
In computer software, shared memory is either
By GS kosta
a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at
the same time. One process will create an area in RAMwhich other processes can access;
a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memorymappings or with explicit support of the program in
question. This is most often used for shared libraries and for XIP.
By GS kosta