Professional Documents
Culture Documents
The input data travels from input unit to ALU. Similarly, the computed
data travels from ALU to output unit. The data constantly moves from
storage unit to ALU and back again. This is because stored data is
computed on before being stored again. The control unit controls all
the other units as well as their data.
• Input Unit
• Output Unit
1
• Storage Unit
• Control Unit
This unit controls all the other units of the computer system
and so is known as its central nervous system. It transfers
data throughout the computer as required including from
storage unit to central processing unit and vice versa. The
control unit also dictates how the memory, input output
devices, arithmetic logic unit etc. should behave.
• Objective:
• Determine the frequent case.
2
• Determine how much improvement in performance is possible by
making it faster.
• Amdahl's Law :
o The performance improvement to be gained from
using some faster mode of execution is limited by
the fraction of the time the faster mode can be
used.
o Two factors:
• Fraction enhanced : Fraction of compute time in original machine that
can be converted to take advantage of the enhancement.
▪ Always <= 1.
• Speedup enhanced : Improvement gained by enhanced execution
mode:
3
o Speedup overall using Amdahl's Law:
4
o Is
this a worthy investment ? (will we get a
52%*100 fold increase in performance)?
5
o Note that CPU time is equally dependent on these,
i.e. a 10% improvement in any one leads to a 10%
improvement in CPU time.
A computer user who runs the same programs day in and day out
would be the perfect candidate to evaluate a new computer. To
6
evaluate a new system the user would simply compare the execution
time of her workload—the mixture of programs and operating system
commands that users run on a machine.There are five levels of
programs used in such circumstances, listed below in decreasing
order of accuracy of prediction.
7
5. Synthetic benchmarks—Similar in philosophy to kernels, synthetic
benchmarks try to match the average frequency of operations and
operands of a large set of programs. Whetstone and Dhrystone are the
most popular synthetic benchmarks.
Benchmark Suites
Desktop Benchmarks
8
1.Pro/Engineer: a solid modeling application that does extensive 3-D
rendering. The input script is a model of a photocopying machine
consisting of 370,000 triangles.
Server Benchmarks
9
Embedded Benchmarks
Basic Pipelining:-
10
Pipeline system is like the modern day assembly line setup in
factories. For example in a car manufacturing industry, huge assembly
lines are setup and at each point, there are robotic arms to perform a
certain task, and then the car moves on ahead to the next arm.
Types of Pipeline
1. Arithmetic Pipeline
2. Instruction Pipeline
Arithmetic Pipeline
X = A*2^a
Y = B*2^b
Here A and B are mantissas (significant digit of floating point
numbers), while a and b are exponents.
Registers are used for storing the intermediate results between the
above operations.
11
Instruction Pipeline
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal
performance. Some of these factors are given below:
3. Branching
In order to fetch and execute the next instruction, we must know what
that instruction is. If the present instruction is a conditional branch,
and its result will lead us to the next instruction, then the next
instruction may not be known until the current one is processed.
4. Interrupts
5. Data Dependency
12
Advantages of Pipelining
Disadvantages of Pipelining
First of all the two exponents are compared and the larger of two
exponents is chosen as the result exponent. The difference in the
exponents then decides how many times we must shift the smaller
exponent to the right. Then after shifting of exponent, both the
13
mantissas get aligned. Finally the addition of both numbers take place
followed by normalisation of the result in the last segment.
Example:
Let us consider two numbers,
Explanation:
First of all the two exponents are subtracted to give 3-2=1. Thus 3
becomes the exponent of result and the smaller exponent is shifted 1
times to the right to give
Y=0.0450*10^3
Z=0.3664*10^3
14
Let us see an example of instruction pipeline.
Example:
I2. R4 <- R2 + R3
I1. R4 <- R1 + R5
I2. R5 <- R1 + R2
When there is a chance that I2 will end before I1 (i.e., when there is
concurrent execution), it must be assured that the result of register R5
is not saved before I1 has had a chance to obtain the operands.
I1. R2 <- R4 + R7
I2. R2 <- R1 + R3
17
There are various methods to handle the data hazard in computer
architecture that occur in the program. Some of the methods to
handle the data hazard are as follows:
1. Forwarding :
It adds special circuitry to the pipeline. This method works
because it takes less time for the required values to travel
through a wire than it does for a pipeline segment to compute
its result.
2. Code reordering :
We need a special type of software to reorder code. We call
this type of software a hardware-dependent compiler.
3. Stall Insertion :
it inserts one or more installs (no-op instructions) into the
pipeline, which delays the execution of the current
instruction until the required operand is written to the
register file, but this method decreases pipeline efficiency
and throughput.
Control Hazards
18
testing (correlate with while, for, if, case statements). These are
converted into one of the BRANCH instruction variations. To
understand the programme flow, you must know the value of the
condition being tested. It’s a difficult situation.
As a result, when the decision to execute one instruction is reliant on
the result of another instruction, such as a conditional branch, which
examines the condition’s consequent value, a conditional hazard
develops.
The Program Counter (PC) is loaded with the appropriate place for the
branch and jump instructions, which determines the programme flow.
The next instruction that is to be fetched and executed by the given
CPU is stored in the PC. Take a look at the instructions that follow:
• Example:
ADD R2,R3,R4
BEQZ R2,L1
LD R1,0(R2)
L1:
What will happen with moving LD before BEQZ? This may lead to
memory protection violation. The branch instruction is a guarding
branch that checks for an address zero and jumps to L1. If this is
moved ahead, then an additional exception will be raised. Data flow is
the actual flow of data values among instructions that produce results
and those that consume them. Branches make flow dynamic and
determine which instruction is the supplier of data
• Example:
DADDU R1,R2,R3
BEQZR4,L
DSUBU R1,R5,R6
L:…
OR R7,R1,R8
20
13.1. If we do a more aggressive implementation by adding hardware
to resolve the branch in the ID stage, the penalty can be reduced.
21
branch can come from either the ALU/MEM or MEM/WB pipeline
latches.
• Because the values in a branch comparison are needed during ID
but may be produced later in time, it is possible that a data
hazard can occur and a stall will be needed. For example, if an
ALU instruction immediately preceding a branch produces one of
the operands for the comparison in the branch, a stall will be
required, since the EX stage for the ALU instruction will occur
after the ID cycle of the branch.
The first option of stalling the pipeline till the branch is resolved, or
fetching again from the resolved address leads to too much of penalty.
Branches are very frequent and not handling them effectively brings
down the performance. We are also violating the principle of “Make
common cases fast”.
22
In the predict not taken approach, treat every branch as “not taken”.
Remember that the registers are read during ID, and we also perform
an equality test to decide whether to branch or not. We simply load in
the next instruction (PC+4) and continue. The complexity arises when
the branch evaluates to be true and we end up needing to actually take
the branch. In such a case, the pipeline is cleared of any code loaded
from the “not-taken” path, and the execution continues.
23
These structural hazards cause stalls in the pipeline. To minimize or
eliminate the stalls due to the structural dependency in the pipeline,
we use a hardware mechanism called the Renaming technique.
• Structural Hazards
• Data Hazards
• Control Hazards
24
In the above execution sequence instructions, I1 and I4 both are trying
to access the same resource, which is Mem (Memory) in the same
CC4(Clock Cycle - 4). This situation in the pipeline is called a structural
hazard.
25
Using the renaming technique to overcome the structural hazard, we
achieve pipeline CPI=1 with 0(zero) stalls.
26
execute, and the program crashes. Exception handling helps ensure
this does not happen when an exception occurs.
The try block detects and throws any found exceptions to the catch
blocks, which then handles them.
Error handling code can also be separated from normal code with the
use of try blocks, which is code that is enclosed in curly braces or
brackets that could cause an exception. Try blocks can help
programmers to categorize exception objects.
27
The try bracket contains the code that encounters the exception and
prevents the application from crashing.
Read
Discuss
28
Whenever an exception or interrupt occurs, the hardware starts
executing the code that performs an action in response to the
exception. This action may involve killing a process, outputting an
error message, communicating with an external device, or horribly
crashing the entire computer system by initiating a “Blue Screen of
Death” and halting the CPU. The instructions responsible for this
action reside in the operating system kernel, and the code that
performs this action is called the interrupt handler code. Now, We can
think of handler code as an operating system subroutine. Then, After
the handler code is executed, it may be possible to continue execution
after the instruction where the execution or interrupt occurred.
Exception and Interrupt Handling :
Whenever an exception or interrupt occurs, execution transition from
user mode to kernel mode where the exception or interrupt is handled.
In detail, the following steps must be taken to handle an exception or
interrupts.
While entering the kernel, the context (values of all CPU registers) of
the currently executing process must first be saved to memory. The
kernel is now ready to handle the exception/interrupt.
At any point in time, the values of all the registers in the CPU defines
the context of the CPU. Another name used for CPU context is CPU
state.
The exception/interrupt handler uses the same CPU as the currently
executing process. When entering the exception/interrupt handler, the
values in all CPU registers to be used by the exception/interrupt
handler must be saved to memory. The saved register values can later
restored before resuming execution of the process.
The handler may have been invoked for a number of reasons. The
handler thus needs to determine the cause of the exception or
interrupt. Information about what caused the exception or interrupt
can be stored in dedicated registers or at predefined addresses in
memory.
29
Next, the exception or interrupt needs to be serviced. For instance, if
it was a keyboard interrupt, then the key code of the key press is
obtained and stored somewhere or some other appropriate action is
taken. If it was an arithmetic overflow exception, an error message
may be printed or the program may be terminated.
The exception/interrupt have now been handled and the kernel. The
kernel may choose to resume the same process that was executing
prior to handling the exception/interrupt or resume execution of any
other process currently in memory.
The context of the CPU can now be restored for the chosen process by
reading and restoring all register values from memory.
The process selected to be resumed must be resumed at the same
point it was stopped. The address of this instruction was saved by the
machine when the interrupt occurred, so it is simply a matter of
getting this address and make the CPU continue to execute at this
address.
Process to maximize the rendering speed, then allow stages that are
not bottlenecks to consume as much time as the bottleneck. Pipeline
Optimization: Process to maximize the rendering speed, then allow
stages that are not bottlenecks to consume as much time as the
bottleneck.
Pipeline Optimization :-
Technique 2: ◦ Make the other two stages work less or (better) not at
all ◦ If performance is the same, then the stages not included above is
30
the bottleneck Complication: the bus between CPU and graphics card
may be bottleneck (not a typical stage).
it implies that all information items are originally stored in level Mn.
During the processing, subsets of Mn are copied into Mn-1. similarity,
subsets of Mn-1 are copied into Mn-2, and so on. Multi-level caches
can be designed in various ways depending on whether the content of
one cache is present in other levels of caches. If all blocks in the
higher level cache are also present in the lower level cache, then
the lower level cache is said to be inclusive of the higher level
cache.
Coherence Property
Locality of references
32
• Spatial Locality - This refers to the tendency for a process to
access items whose addresses are nearer to another for example
operations on tables or arrays involve access of a certain
clustered area in the address space.
• Sequential Locality - In typical programs the execution of
instructions follows a sequential order unless branch
instructions create out of order execution. The ratio of in order
execution to out of order execution is roughly 5:1 in ordinary
programs.
1. Capacity
It refers to the total volume of data that a system’s memory can store.
The capacity increases moving from the top to the bottom in the
Memory Hierarchy.
2. Access Time
It refers to the time interval present between the request for
read/write and the data availability. The access time increases as we
move from the top to the bottom in the Memory Hierarchy.
3. Performance
When a computer system was designed earlier without the Memory
Hierarchy Design, the gap in speed increased between the given CPU
registers and the Main Memory due to a large difference in the
system’s access time. It ultimately resulted in the system’s lower
performance, and thus, enhancement was required. Such a kind of
enhancement was introduced in the form of Memory Hierarchy Design,
and because of this, the system’s performance increased. One of the
primary ways to increase the performance of a system is minimising
how much a memory hierarchy has to be done to manipulate data.
1. Registers
The register is usually an SRAM or static RAM in the computer
processor that is used to hold the data word that is typically 64 bits
or 128 bits. A majority of the processors make use of a status word
33
register and an accumulator. The accumulator is primarily used to
store the data in the form of mathematical operations, and the status
word register is primarily used for decision making.
2. Cache Memory
The cache basically holds a chunk of information that is used
frequently from the main memory. We can also find cache memory in
the processor. In case the processor has a single-core, it will rarely
have multiple cache levels. The present multi-core processors would
have three 2-levels for every individual core, and one of the levels is
shared.
3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit
that communicates directly. It’s the primary storage unit of a
computer system. The main memory is very fast and a very large
memory that is used for storing the information throughout the
computer’s operations. This type of memory is made up of ROM as
well as RAM.
4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated
with plastic or metal with a magnetised material. Two faces of a disk
are frequently used, and many disks can be stacked on a single spindle
by read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.
5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a
slender magnetizable overlay that covers an extended, thin strip of
plastic film. It is used mainly to back up huge chunks of data. When a
computer needs to access a strip, it will first mount it to access the
information. Once the information is allowed, it will then be
unmounted. The actual access time of a computer memory would be
slower within a magnetic strip, and it will take a few minutes for us to
access a strip.
12 | P a g e
Cache Memory Organization:-
data.
Levels of memory:
35
• Level 3 or Main Memory – It is memory on which computer
works currently. It is small in size and once power is off data
no longer stays in this memory.
• Level 4 or Secondary Memory – It is external memory which
is not as fast as main memory but data stays permanently in
this memory.
We can improve Cache performance using higher cache block size, and
higher associativity, reduce miss rate, reduce miss penalty, and
reduce the time to hit in the cache.
Cache Mapping: There are three different types of mapping used for
the purpose of cache memory which is as follows: Direct mapping,
Associative mapping, and Set-Associative mapping. These are
explained below.
A. Direct Mapping
The simplest technique, known as direct mapping, maps each block of
main memory into only one possible cache line. or In Direct mapping,
assign each memory block to a specific line in the cache. If a line is
previously taken up by a memory block when a new block needs to be
loaded, the old block is trashed. An address space is split into two
parts index field and a tag field. The cache is used to store the tag field
whereas the rest is stored in the main memory. Direct mapping`s
performance is directly proportional to the Hit ratio.
i = j modulo m
where
36
i=cache line number
37
B. Associative Mapping
In this type of mapping, the associative memory is used to store
content and addresses of the memory word. Any block can go into any
line of the cache. This means that the word id bits are used to identify
which word in the block is needed, but the tag becomes all of the
remaining bits. This enables the placement of any word at any place
in the cache memory. It is considered to be the fastest and the most
flexible mapping form. In associative mapping the index bits are zero.
C. Set-associative Mapping
This form of mapping is an enhanced form of direct mapping where
the drawbacks of direct mapping are removed. Set associative
addresses the problem of possible thrashing in the direct mapping
method. It does this by saying that instead of having exactly one line
that a block can map to in the cache, we will group a few lines together
creating a set. Then a block in memory can map to any one of the lines
of a specific set. Set-associative mapping allows that each word that is
present in the cache can have two or more words in the main memory
for the same index address. Set associative cache mapping combines
the best of direct and associative cache mapping techniques. In set
associative mapping the index bits are given by the set offset bits. In
this case, the cache consists of a number of sets, each of which
consists of a number of lines. The relationships are
m=v*k
i= j mod v
38
where
v=number of sets
39
2. The correspondence between the main memory blocks and
those in the cache is specified by a mapping function.
3. Primary Cache – A primary cache is always located on the
processor chip. This cache is small and its access time is
comparable to that of processor registers.
4. Secondary Cache – Secondary cache is placed between the
primary cache and the rest of the memory. It is referred to as
the level 2 (L2) cache. Often, the Level 2 cache is also housed
on the processor chip.
5. Spatial Locality of reference This says that there is a chance
that the element will be present in close proximity to the
reference point and next time if again searched then more
close proximity to the point of reference.
6. Temporal Locality of reference In this Least recently used
algorithm will be used. Whenever there is page fault occurs
within a word will not only load word in main memory but
complete page fault will be loaded because the spatial locality
of reference rule says that if you are referring to any word next
word will be referred to in its register that’s why we load
complete page table so the complete block will be loaded.
Every time your cache is purged, the data in it needs to be written into
the memory after the first request. This is why, at Kinsta, we use
the Kinsta MU plugin so that only certain sections of the cache are
purged.
The more you purge your cache, the more likely cache misses are to
occur. Of course, sometimes clearing your cache is necessary.
However, one way you can prevent this problem is to expand the
lifespan of your cache by increasing its expiry time. Keep in mind that
the expiry time should coincide with how often you update your
website to ensure that the changes appear to your users.
For example, if you don’t frequently update your site, you can
probably set the expiry time to two weeks. Alternatively, if site
updates are a weekly occurrence, your expiry time shouldn’t exceed a
day or two.
Your options for doing this will vary depending on your hosting
provider. If you rely on caching plugins, you can use the WP Rocket
40
plugin. Once installed and activated, you can navigate
to Settings > WP Rocket, followed by the Cache tab.
Under the Cache Lifespan section, you will be able to specify the
global expiry time for when the cache is cleared. When you’re done,
you can click on the Save Changes button at the bottom of the page.
However, increasing your RAM can be a bit pricey. You may want to
check with your hosting provider to see what your options are. For
example, at Kinsta we offer scalable hosting. This means that you can
easily scale up your plan without having to worry about downtime.
1. First In First Out (FIFO): This policy means that the data that was
added the earliest to the cache will be the first to be evicted.
2. Last In First Out (LIFO): This means that the data entries added
last to the cache will be the first to be removed.
3. Least Recently Used (LRU): True to its name, this policy first
evicts the data accessed the longest time ago.
4. Most Recently Used (MRU): With this policy, the data most
recently accessed is evicted first.
6.Lowering the cost of computer systems as you find the right amount
of main memory and virtual memory
42
3.Moving data between a computer's virtual and physical memory
requires more from the computer's hardware.
5.If a computer only has a small amount of RAM, virtual memory can
cause thrashing, which is when the computer must constantly swap
data between virtual and physical memory, resulting in significant
performance delays.
6.It can take longer for applications to load or for a computer to switch
between applications when using virtual memory.
Mapping techniques
The process of transfer the data from main memory to cache memory
is called as mapping. In the cache memory, there are three kinds
of mapping techniques are used.
1. Associative mapping
2. Direct mapping
3. Set Associative mapping
1. Valid bit: This gives the status of the data block. If 0 then the
data block is not referenced and if 1 then the data block is
referenced.
2. Tag: This is the main memory address part.
3. Data: This is the data block.
1) Associative mapping
43
• Associative cache controller interprets the CPU generated request
as:
• The existing tag in the cache controller compared with the CPU
generated tags.
• If anyone of the tag in the matching operation becomes hit. So,
based on the word offset the respective data is transfer to CPU.
• If none of the tags are matching operation become miss. So, the
references will be forwarded to the main memory.
• According to the main memory address format, the respective
main memory block is enabled then transferred to the cache
memory by using the associative mapping. Later the data will be
transfer to the CPU.
• In this mapping technique, replacement algorithms are used to
replace the cache block when the cache is full.
• Tag memory size = number of lines * number of tag bits in the
line.
2) Direct mapping
K mod N = i
Where,
44
• N is the number of cache lines.
• And, i is the cache memory line number.
45
3) Set Associative Mapping
K mod S = i
Where,
46
• The main memory block is transferred to the cache memory by
using a set associative mapping function. Later data is transfer to
CPU.
• In this technique, replacement algorithms are used to replace the
blocks in the cache line, when the set is full.
• Tag memory size = number of sets in cache * number of blocks
in the set * tag bit.
47
in FIFO. The LRU algorithm can be executed by relating a counter with
each page that is in the main memory.
When a page is referenced, its associated counter is set to zero. At
permanent intervals of time, the counters related to all pages directly
in memory are incremented by 1.
48
Now, the decision of when to execute an operation depends largely
on the compiler rather than hardware. However, extent of compiler’s
control depends on type of ILP architecture where information
regarding parallelism given by compiler to hardware via program
varies. The classification of ILP architectures can be done in the
following ways –
1. Sequential Architecture :
Here, program is not expected to explicitly convey any
information regarding parallelism to hardware, like
superscalar architecture.
2. Dependence Architectures :
Here, program explicitly mentions information regarding
dependencies between operations like dataflow architecture.
3. Independence Architecture :
Here, program gives information regarding which operations
are independent of each other so that they can be executed
instead of the ‘nop’s.
In order to apply ILP, compiler and hardware must determine data
dependencies, independent operations, and scheduling of these
independent operations, assignment of functional unit, and register
to store data.
Techniques for increasing instruction level parallelism:-
Instruction Level Parallelism (ILP) is the number of instructions that
can be executed in simultaneously a program in a clock cycle. The
microprocessors exploit ILP by means of several techniques that have
been implemented in the last decades and according to the advances
that have been obtained in hardware, this survey presents the
different
techniques that have been used successfully in the execution of
multiple instructions of a single program in a single clock cycle.
Superscalar Architecture:-
A more aggressive approach is to equip the processor with multiple
processing units to handle several instructions in parallel in each
processing stage. With this arrangement, several instructions start
execution in the same clock cycle and the process is said to use
multiple issue. Such processors are capable of achieving an
instruction execution throughput of more than one instruction per
cycle. They are known as ‘Superscalar Processors’.
49
In the above diagram, there is a processor with two execution units;
one for integer and one for floating point operations. The instruction
fetch unit is capable of reading the instructions at a time and storing
them in the instruction queue. In each cycle, the dispatch unit
retrieves and decodes up to two instructions from the front of the
queue. If there is one integer, one floating point instruction and no
hazards, both the instructions are dispatched in the same clock
cycle.
Advantages of Superscalar Architecture :
• The compiler can avoid many hazards through judicious
selection and ordering of instructions.
• The compiler should strive to interleave floating point and
integer instructions. This would enable the dispatch unit to
keep both the integer and floating point units busy most of
the time.
• In general, high performance is achieved if the compiler is
able to arrange program instructions to take maximum
advantage of the available hardware units.
Disadvantages of Superscalar Architecture :
• In a Superscalar Processor, the detrimental effect on
performance of various hazards becomes even more
pronounced.
• Due to this type of architecture, problem in scheduling can
occur.
Superpipelining:-
VLIW Processors :-
Array processor:-
51
Attached array processor has two interfaces:
1. Input output interface to a common processor.
2. Interface with a local memory.
52
The processing units are synchronized to perform the same operation
under the control of a common control unit. Thus providing a single
instruction stream, multiple data stream (SIMD) organization. As
shown in figure, SIMD contains a set of identical processing elements
(PES) each having a local memory M.
Each PE includes –
• ALU
• Floating point arithmetic unit
• Working registers
Master control unit controls the operation in the PEs. The function of
master control unit is to decode the instruction and determine how
the instruction to be executed. If the instruction is scalar or program
control instruction then it is directly executed within the master
control unit.
Main memory is used for storage of the program while each PE uses
operands stored in its local memory.
Vector Processor:-
Vector processor is basically a central processing unit that has the
ability to execute the complete vector input in a single instruction.
More specifically we can say, it is a complete unit of hardware
resources that executes a sequential set of similar data items in the
memory using a single instruction.
We know elements of the vector are ordered properly so as to have
successive addressing format of the memory. This is the reason why
we have mentioned that it implements the data sequentially.
It holds a single control unit but has multiple execution units that
perform the same operation on different data elements of the vector.
Multiprocessor architecture:-
A Multiprocessor is a computer system with two or more central
processing units (CPUs) share full access to a common RAM. The main
objective of using a multiprocessor is to boost the system’s execution
speed, with other objectives being fault tolerance and application
matching.
There are two types of multiprocessors, one is called shared memory
multiprocessor and another is distributed memory multiprocessor. In
shared memory multiprocessors, all the CPUs shares the common
memory but in a distributed memory multiprocessor, every CPU has
its own private memory.
Applications of Multiprocessor –
1. As a uniprocessor, such as single instruction, single data
stream (SISD).
2. As a multiprocessor, such as single instruction, multiple data
stream (SIMD), which is usually used for vector processing.
3. Multiple series of instructions in a single perspective, such as
multiple instruction, single data stream (MISD), which is used
for describing hyper-threading or pipelined processors.
4. Inside a single system for executing multiple, individual
series of instructions in multiple perspectives, such as
multiple instruction, multiple data stream (MIMD).
Benefits of using a Multiprocessor –
54
• Enhanced performance.
• Multiple applications.
• Multi-tasking inside an application.
• High throughput and responsiveness.
• Hardware sharing among CPUs.
55
There are multiple processors in a multiprocessor system that share
peripherals, memory etc. So, it is much more complicated to schedule
processes and impart resources to processes.than in single processor
systems. Hence, a more complex and complicated operating system is
required in multiprocessor systems.
Large Main Memory Required
All the processors in the multiprocessor system share the memory. So
a much larger pool of memory is required as compared to single
processor systems.
Taxonomy of parallel architecture(flynn’s taxonomy):-
Parallel computing is a computing where the jobs are broken into
discrete parts that can be executed concurrently. Each part is further
broken down to a series of instructions. Instructions from each part
execute simultaneously on different CPUs. Parallel systems deal with
the simultaneous use of multiple computer resources that can
include a single computer with multiple processors, a number of
computers connected by a network to form a parallel processing
cluster or a combination of both.
Parallel systems are more difficult to program than computers with a
single processor because the architecture of parallel computers
varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized.
The crux of parallel processing are CPUs. Based on the number
of instruction and data streams that can be processed
simultaneously, computing systems are classified into four major
categories:
56
Flynn’s classification –
57
transfer information internally. Dominant representative SISD
systems are IBM PC, workstations.
2. Single-instruction, multiple-data (SIMD) systems –
An SIMD system is a multiprocessor machine capable of
executing the same instruction on all the CPUs but operating
on different data streams. Machines based on an SIMD model
are well suited to scientific computing since they involve lots
of vector and matrix operations. So that the information can
be passed to all the processing elements (PEs) organized data
elements of vectors can be divided into multiple sets(N-sets
for N PE systems) and each PE can process one data set.
Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data
set. Machines built using the MISD model are not useful in
most of the application, a few machines are built, but none of
them are available commercially.
58
4. Multiple-instruction, multiple-data (MIMD) systems –
An MIMD system is a multiprocessor machine which is capable
of executing multiple instructions on multiple data sets. Each
PE in the MIMD model has separate instruction and data
streams; therefore machines built using this model are
capable to any kind of application. Unlike SIMD and MISD
machines, PEs in MIMD machines work asynchronously.
59
does not happen in the case of distributed memory, in which
each PE has its own memory. As a result of practical outcomes
and user’s requirement , distributed memory MIMD
architecture is superior to the other existing models.
• For example:
• CPU A reads location x, getting the value N .
• Later, CPU B reads the same location, getting the value N .
• Next, CPU A writes location x with the value N - 1 .
• At this point, any reads from CPU B will get the value N ,
while reads from CPU A will get the value N - 1 .
60
• Upon
closer inspection, there are several aspects that need
to be addressed.
Cache Coherence
• Coherence defines what values can be returned by a read.
3. Memory consistency :
Memory consistency defines the order in which memory
operations (from any process) appear to execute with respect to
one another.
What orders are kept?
Given a load, what are the possible values it can return?
It is impossible to tell much about the execution of an SAS
(Statistical Analysis System) programmed without it.
Consequences for both programmers and system designers.
A programmed is used by a programmer to reason about
correctness and possible outcomes.
System designers can use this to limit the number of accesses
that can be reordered by the compiler or hardware.
Agreement between the programmer and the system.
Interconnection Network:-
62
DSM permits programs running on separate reasons to share
information while not the software engineer having to agitate
causation message instead underlying technology can send the
messages to stay the DSM consistent between compute. DSM permits
programs that wont to treat constant laptop to be simply tailored to
control on separate reason. Programs access what seems to them to
be traditional memory.
Hence, programs that Pine Tree State DSM square measure sometimes
shorter and easier to grasp than programs that use message passing.
But, DSM isn’t appropriate for all things. Client-server systems square
measure typically less suited to DSM, however, a server is also wont
to assist in providing DSM practicality for information shared
between purchasers.
Architecture of Distributed Shared Memory (DSM) :
Every node consists of 1 or additional CPU’s and a memory unit. High-
speed communication network is employed for connecting the nodes.
A straightforward message passing system permits processes on
completely different nodes to exchange one another.
Memory mapping manager unit :
Memory mapping manager routine in every node maps the native
63
memory onto the shared computer storage. For mapping operation,
the shared memory house is divided into blocks.
Information caching may be a documented answer to deal with
operation latency. DMA uses information caching to scale back
network latency. the most memory of the individual nodes is
employed to cache items of the shared memory house.
Memory mapping manager of every node reads its native memory as
an enormous cache of the shared memory house for its associated
processors. The bass unit of caching may be a memory block. Systems
that support DSM, information moves between secondary memory
and main memory also as between main reminiscences of various
nodes.
Communication Network Unit :
Once method access information within the shared address house
mapping manager maps the shared memory address to the physical
memory. The mapped layer of code enforced either within the
operating kernel or as a runtime routine.
Physical memory on every node holds pages of shared virtual–
address house. Native pages area unit gift in some node’s memory.
Remote pages in some other node’s memory.
Cluster computing:-
it is a collection of tightly or loosely connected computers that work
together so that they act as a single entity. The connected computers
execute operations all together thus creating the idea of a single
system. The clusters are generally connected through fast local area
networks (LANs)
64
Cluster Computing
65
IPs are needed by every node and those are accessed only through
the internet or web. This type of cluster causes enhanced security
concerns.
2. Close Cluster :
The nodes are hidden behind the gateway node, and they provide
increased protection. They need fewer IP addresses and are good for
computational tasks.
Cluster Computing Architecture :
• It is designed with an array of interconnected individual
computers and the computer systems operating collectively
as a single standalone system.
• It is a group of workstations or computers working together
as a single, integrated computing resource connected via high
speed interconnects.
• A node – Either a single or a multiprocessor network having
memory, input and output functions and an operating system.
• Two or more nodes are connected on a single line or every
node might be connected individually through a LAN
connection.
66
4. Network switching hardware
Cluster Components
67
Disadvantages of Cluster Computing :
1. High cost :
It is not so much cost-effective due to its high hardware and its
design.
2. Problem in finding fault :
It is difficult to find which component has a fault.
3. More space is needed :
Infrastructure may increase as more servers are needed to manage
and monitor.
Applications of Cluster Computing :
• Various complex computational problems can be solved.
• It can be used in the applications of aerodynamics,
astrophysics and in data mining.
• Weather forecasting.
• Image Rendering.
• Various e-commerce applications.
• Earthquake Simulation.
• Petroleum reservoir simulation.
68