Professional Documents
Culture Documents
COMPUTER SYSTEM
KTUStudents.in
ARCHITECTURE
SUFYAN P
Assistant Professor
sufyan@meaec.edu.in
Computer Science and engineering
MEA Engineering College, Perinthalmanna
KTUStudents.in
Jotwani, Advanced
Computer Architecture,
Parallelism,
Scalability,
Programmability, TMH,
2010.
KTUStudents.in
❖ Computer hardware:
• Consists of electronic circuits, displays, magnetic and optical
storage media, electromechanical equipment and
communication facilities.
❖ Computer Architecture:
• It is concerned with the structure and behavior of the
computer.
• It includes the information formats, the instruction set and
techniques for addressing memory.
KTUStudents.in
• Multiprocessors and multi-computers
• Cache Coherence Protocols
• Multicomputers
• Pipelining computers and Multithreading
KTUStudents.in
with Von Neumann Architecture
o Build as a sequential machine
o Executing scalar data
• Major advancement came due to following
techniques
o Look-ahead technique
o Parallelism & pipelining
o Flynn’s classification
o Parallel / vector computers
KTUStudents.in
o I/E➔ instruction fetch and execute
KTUStudents.in
• Classifications are:-
KTUStudents.in
o SISD(Single instruction stream over single data stream)
• E.g.: conventional sequential machines
o SIMD(single instruction stream over multiple data stream)
• E.g.: Vector computers
o MIMD
• E.g.: parallel computers
o MISD
• E.g.: special purpose computers
KTUStudents.in
o Message passing multicomputers
• Distinction lies in
o Memory sharing
o Interprocessor communication
KTUStudents.in
• Interprocessor communication done via message passing
KTUStudents.in
• Memory to memory architecture
o supports pipelined flow of vector operands directly from memory to
pipelines & then back to the memory
• Clock rate
o Inverse of cycle time
• Instruction count
o No: of machine instructions to be executed in a program
KTUStudents.in
• CPI (cycles per instruction)
o No: of cycles taken to execute one instruction
KTUStudents.in
• T = Ic * CPI* τ ………………..( 1)
o k➔ depends on
• speed of cache
• memory technology
KTUStudents.in
• Processor memory interconnection scheme
KTUStudents.in
Ic
• The eq1.1 can be rewritten as following
• T= Ic * CPI* τ ➔ T=Ic *C*τ
Ic
➔ T= C * τ ……….(4)
➔ T= C ………….(5)
f
KTUStudents.in
T * 106
• In terms of eq 1.1, the above eq can be rewritten as
following
Ic ➔f …..(7) ➔ f
Ic * CPI * τ * 106 CPI * 106 C/Ic * 106
➔ Ic * f ………(8)
MIPS rate = C * 106
KTUStudents.in
• Wp = f ……..(9)
Ic * CPI
KTUStudents.in
KTUStudents.in
• C=155000 cycles
KTUStudents.in
• Effective CPI= C/Ic
➔ 155000/100000
CPI = 1.55
KTUStudents.in
= 25.8
KTUStudents.in
• =100000*1.55*0.025
• =3875
• = 3.875 ms
KTUStudents.in
KTUStudents.in
• Flops➔ no : of floating point operations per second
KTUStudents.in
o This is done by compiler
o Compiler must be able to detect parallelism
KTUStudents.in
• Their difference is based on memory
o One has shared common memory
o Other has unshared distributed memory
KTUStudents.in
o COMA model (cache only memory architecture)
KTUStudents.in
• Due to this high degree of resource sharing,
multiprocessors are also called as tightly coupled
systems
• Communication & synchronization b/w processors are
done via shared variables
• System interconnection is done using
o Bus
o Crossbar switch
o Multistage network
KTUStudents.in
KTUStudents.in
• Speed up the execution of a single large program
in time critical applications
KTUStudents.in
programs such as OS kernel, I/O routines
KTUStudents.in
o Shared local memory model
o Hierarchical cluster model
KTUStudents.in
o Collection of local memories forms a global address space
o This is accessible by all processors
• Access variations
o Access to the local memory attached with a local processor is faster
o Access of remote memory attached with other processors takes longer
time
• This is due to delay in interconnection n/w
KTUStudents.in
o All clusters have equal access to global memory
• All processors belonging to same cluster uniformly
access the cluster shared memory modules (CSM)
• Access time to CSM is shorter than GSM
• Access rights to inter cluster memories can be specified
in various ways
KTUStudents.in
• No memory hierarchy
• All caches form a global address space
• Remote cache access is assisted by distributed cache
directories
KTUStudents.in
KTUStudents.in
KTUStudents.in
local processors
• Therefore they are also called as no-remote-memory
access machines (NORMA)
• Inter-node communications are carried out by message
passing
KTUStudents.in
KTUStudents.in
KTUStudents.in
• Supercomputers are classified as
o Vector supercomputers
• Uses powerful processors equipped with vector hardware
o SIMD supercomputers
• Provide massive data parallelism
KTUStudents.in
KTUStudents.in
• Scalar control unit decodes all the instructions
KTUStudents.in
KTUStudents.in
• Final vector results
o Vector functional pipelines receives operands from
this registers & put the results back to these registers
o All vector registers are programmable
o Each vector register has a component counter
• It keeps track of component registers used in
successive pipeline cycles
KTUStudents.in
• A vector stream unit is used instead of vector registers
• Vector operands & results are directly retrieved from
and stored into the main memory
• Eg: Cyber 205
KTUStudents.in
KTUStudents.in
KTUStudents.in
executed by each PE over the data within that PE
• M➔ set of masking schemes
o Each mask partitions set of PE’s to enabled & disabled
subsets
• R➔ set of data routing functions
o Used in interconnection n/w for inter PE
communications
KTUStudents.in
KTUStudents.in
latency of the execution of a task at a fixed workload that
can be expected of a system whose resources are
improved.
• In other words, it is a formula used to find the maximum
improvement possible by just improving a particular
part of a system.
• It is often used in parallel computing to predict the
theoretical speedup when using multiple processors.
KTUStudents.in
entire task using the enhancement and performance for
the entire task without using the enhancement
• OR
• Speedup can be defined as the ratio of execution time
for the entire task without using the enhancement and
execution time for the entire task using the
enhancement.
KTUStudents.in
• Ew is the execution time for entire task without using
the enhancement and
• Ee is the execution time for entire task using the
enhancement when possible then,
• Speedup = Pe/Pw
or
Speedup = Ew/Ee
• Fraction enhanced
KTUStudents.in
• Speedup enhanced
KTUStudents.in
• For example- if 10 seconds of the execution time of a program
that takes 40 seconds in total can use an enhancement , the
fraction is 10/40. This obtained value is Fraction Enhanced.
KTUStudents.in
KTUStudents.in
• Control dependence
• Resource dependence
KTUStudents.in
o Flow dependence
o Anti-dependence
o Output dependence
o I/O dependence
o Unknown dependence
KTUStudents.in
• S1: LOAD R1,A
• S2: ADD R2,R1
• S2 is flow dependent on S1
o Coz o/p of S1 is fed as i/p of S2
o Ie variable A is passed into register R1
•
•
S1:
S2:
KTUStudents.in
READ(4),A(i)
PROCESS
• S3: WRITE(4),B(I)
• S4: CLOSE(4)
KTUStudents.in
• Conditional branching may eliminate or introduce data
dependencies
• Control dependency prohibits parallelism
• Solution
o Compiler techniques
o Hardware Branch prediction technique
KTUStudents.in
o Registers
o Memory areas
KTUStudents.in
programmers are required to exploit parallelism
KTUStudents.in
o Algorithm
o Programming style
o Program design
KTUStudents.in
o Multiple functional units
• Limitations
o Length of pipeline
o Multiplicity of functional units
KTUStudents.in
• Features included in existing compilers to improve
parallelism are:-
o Loop transformation
o s/w pipelining