You are on page 1of 29

CSE539: Advanced ComputerArchitecture

Chapter 8
Multivector and SIMD Computers
Book: “Advanced ComputerArchitecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Sumit Mittu
Assistant Professor, CSE/IT
Lovely Professional University
sumit.12735@lpu.co.in
In this chapter…

• Vector Processing Principles


• Compound Vector Operations
• Vector Loops and Chaining
• SIMD Computer Implementation Models

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 2


Vector Processing Principles
• A vector is a set of scalar data items, all of the same type, stored in
memory. Usually, the vector elements are ordered to have a fixed
addressing increment between successive elements called the stride.
• A vector processor is an ensemble of hardware resources, including
vector registers, functional pipelines, processing elements, and
register counters, for performing vector operations.
• Vector processing occurs when arithmetic or logical operations are
applied to vectors. The conversion from scalar processing to vector
code is called vectorization.
• Vector processing speedup 10..20 compared with scalar processing.
A compiler capable of vectorization is called vectorizing compiler or
vectorizer.
VECTORPROCESSING PRINCIPLES

• Vector Processing Definitions


o Vector
o Stride
o Vector Processor
o Vector Processing
o Vectorization
o Vectorizing Compiler or Vectorizer

• Vector Instruction Types


o Vector-vector instructions
o Vector-scalar instructions
o Vector-memoryinstructions

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 4


VECTORPROCESSING PRINCIPLES

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 5


Vector instructions
1. Vector-vector instructions
One or two vector operands are fetched form the respective vector
registers, enter through a functional pipeline unit, and produce result in
another vector register.
2. Vector-scalar instructions
3. vector-memory instructions Store-load of vector registers
4. Vector reduction instructions maximum, minimum, sum, mean
value.
5. Gather and scatter instructions Two instruction registers are used to
gather or scatter vector elements randomly through the memory
(operations with sparse vectors).
6. Masking instructions The Mask vector is used to compress or to
expand a vector to a shorter or longer index vector (bit per index
correspondence).
VECTORPROCESSING PRINCIPLES

• Vector-Vector Instructions
o F1: Vi  Vj
o F2: Vi x Vj Vk
o Examples: V1 = sin(V2) V3 = V1+ V2

• Vector-Scalar Instructions
o F3: s x Vi  Vj
o Examples: V2 = 6 +
V1
• Vector-Memory Instructions
o F4: MV (Vector Load)
o F5: VM (Vector Store)
o Examples: X = V1 V2 = Y

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 7


VECTORPROCESSING PRINCIPLES

• Vector Reduction Instructions


o F6: Vi  s
o F7: Vi x Vj  s

• Gather and Scatter Instructions


o F8: M  Vi x Vj (Gather)
o F9: Vi x Vj  M (Scatter)

• Masking
o F10: Vi x Vm  Vj (Vm is a binary
vector)
• Examples…

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 8


VECTORPROCESSING PRINCIPLES

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 9


VECTORPROCESSING PRINCIPLES

• Vector-Access Memory Schemes


o Vector-operandSpecifications
• Baseaddress, stride and length
o C-Access Memory Organization
• Low-order m-way interleavedmemory
o S-accessMemory Organizations
• High-order m-way interleavedmemory
o C/SAccess Memory Organization

• Early Supercomputers (Vectors Processors)


o Cray Series ETA 10E NEC Sx-X
44
o CDC Cyber Fujitsu VP2600 Hitachi
820/80
Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 10
Vector-access memory
schemes
Vector operands may have arbitrary length.
Vector elements are not necessarily stored in contiguous memory
locations.
To access a vector a memory, one must specify its base, stride, and
length.
Since each vector register has fixed length, only a segment of the
vector can be loaded into a vector register.
Vector operands should be stored in memory to allow pipelined and
parallel access. Access itself should be pipelined.
C-Access memory organization The m-way low-order memory
structure, allows m words to be accessed concurrently and
overlapped.
S-Access memory organization All modules are accessed
simultaneously storing consecutive words to data buffers. The low
order address bits are used to multiplex the m words out of buffers.
C/S-Access memory organization.
C-access
Eight-way interleaved memory (m =
8 and w = 8). m is called the degree
of interleaving.
The major cycle is the total time
required to complete the access of
a single word form a memory. The
minor cycle is the actual time
needed to produce one word,
assuming overlapped access of
successive memory modules
separated in every memory cycle .

EENG-630
Relative Vector/Scalar
Performance
Let r be the vector/scalar speed ratio and f the
vectorization ratio.
By Ahmdahl's law the following relative performance
can be defined:
1 r
P 
(1  f )  f r (1  f )r  f
The limiting case is P -> 1 if f -> 0.
Example: IBM - r = 3:::4, Cray - r = 10:::25.

EENG-630
Multivector
Multiprocessors
Architecture Design Goals
Maintaining a good vector/scalar performance balance. The
vector balance point is defined as the percentage of vector
code in a program required to achieve equal utilization of
vector and scalar hardware (usually 90...97%).
Supporting scalability with an increasing number of
processors (The dominant problem involves support of
shared memory with an increasing number of processor and
memory ports).
Increasing memory system capacity (up to Tbytes today,
hierarchy is necessary) and performance
Providing high-performance I/O (>50 Gbytes/s) and easy-
access network.
Compound Vector Processing
A compound vector function (CVF) is defined as a composite
function of vector operations converted from a looping structure of
linked scalar operations.
Do 10 I=1,N
Load R1, X(I)
Load R2, Y(I)
Multiply R1, S
Add R2, R1
Store Y(I), R2
10 Continue
CVF: Y(1:N) = S X(1:N) + Y(1:N)
or
Y(I) = S X(I) + Y(I)
CVF Chaining and Strip-
mining
Typical CVF for one-dimensional arrays are load, store, multiply, divide,
logical and shifting operations.
*** The number of available vector registers and functional
pipelines impose some restrictions on how many CVFs can be executed
simultaneously.
Chaining
Chaining is an extension of technique of internal data
forwarding practiced in scalar processors. Chaining is limited by
the small number of functional pipelines available in a vector processor.
Strip-mining
When a vector has a length greater than that of the
vector registers, segmentation of the long vector into fixed-length
segments is necessary. One vector segment is processed at a time
(in Cray computers segment is 64 elements).
Recurrence
The special case of vector loops in which the output of
a functional pipeline may feed back into one of its own source
vector registers.
SIMD Computer
Organizations
SIMD models dierentiates on base of memory distribution
and addressing scheme used. Most SIMD computers use
a single control unit and distributed memories, except for
a few that use associative memories.
Distributed memory model
Spatial parallelism among PEs. A
distributed memory SIMD consists of an array of
PEs (supplied with local memory) which are controlled
by the array control unit. Program and data are
loaded into the control memory through the host
computer and distributed from there to PEs local
memories.
SIMD using distributed
local memory
SIMD using shared-memory
VECTORPROCESSING PRINCIPLES

• Relative Vector/Scalar Performance


o Vector/scalar speed ratio r
o Vectorizationratio in program f
o Relative PerformancePis given by:
𝟏 𝒓
• 𝑷= =
𝟏−𝒇 + 𝒇/𝒓 𝟏−𝒇 𝒓 + 𝒇
o Whenf is low, the speedup cannotbe high even with very high r
o LimitingCase:
• P 1 if f  0
o MaximumCase:
• P r if f  1
o Powerful singlechip processors and multicore system-on-a-chipprovide High-Performance
Computing (HPC) using MIMD and/or SPMDconfigurations with large no. of processors.

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 20


COMPUOUND VECTORPROCESSING

• Compound Vector Operations


o CompoundVector Functions (CVFs)
• Composite functionof vector operations converted from a looping structure of linkedscalar
operations
o CVFExample:The SAXPY(Single-precision AmultiplyX Plus Y) Code
• For I = 1 to N
o Load R1, X(I)
o Load R2, Y(I)
o Multiply R1,A
o Add R2, R1
o Store Y(I), R2
• (End of Loop)

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 21


COMPUOUND VECTORPROCESSING

• One-dimensional CVF Examples


o V(I) = V2(I) + V(3) x V(4)
o V1(I) = B(I) + C(I)
o A(I) = V(I) x S+ B(I)
o A(I) = V(I) + B(I) + C(I)
oA(I) = Qx v1(I) (R x B(I) + C(I)), etc.
Legend:
o Vi(I) are vector registers
o A(I), B(I), C(I) are vectors in memory
o Q, Sare scalars availablefrom scalar registersin memory

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 22


COMPUOUND VECTORPROCESSING

• Vector Loops
o Vector segmentationor strip-miningapproach
o Example

• Vector Chaining
o Example: SAXPYcode
• Limited Chainingusing only one memory-access pipe in Cray-I
• Complete Chainingusing three memory-access pipes in Cray X-MP

• Functional Unit Independence


• Vector Recurrence

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 23


COMPUOUND VECTORPROCESSING

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 24


COMPUOUND VECTORPROCESSING

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 25


SIMD COMPUTER ORGANIZATIONS

• SIMD Computer Variants


o Array Processor
o AssociativeProcessor
• SIMDProcessor v/s SISD v/s Vector Processor Operation
o Illustration: for(i=0;i<5;i++)a[i] = a[i]+2;
o Lockstepmodeof operationin SIMD processor
o Relative Performancecomparison
• SIMDImplementation Models
o DistributedMemory Model
• E.g. Illiac IV
o Sharedmemory Model
• E.g. BSP(Burroughs ScientificProcessor)

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 26


SIMD COMPUTER ORGANIZATIONS

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 27


SIMD COMPUTER ORGANIZATIONS

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 28


SIMD COMPUTER ORGANIZATIONS

• SIMD Instructions
o Scalar Operations
• Arithmetic/Logical
o Vector Operations
• Arithmetic/Logical
o Data RoutingOperations
• Permutations, broadcasts, multicasts, rotationand shifting
o MaskingOperations
• Enable/DisablePEs
• Host and I/O
• Bit-slice and Word-slice Processing
o WSBS,WSBP, WPBS,WPBP

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 29

You might also like