You are on page 1of 28

SIMD

Single instruction, multiple


data (SIMD)
Contents
 Parallel Processors
 Flynn's taxonomy
 What is SIMD?
 Types of Processing
 Scalar Processing
 Vector Processing
 Architecture for Vector Processing
 Vector processors
 Vector Processor Architectures
 Components of Vector Processors
 Advantages of Vector Processing
 Array processors
 Array Processor Classification
 Array Processor Architecture
 Dedicated Memory Organization
 Global Memory Organization
 ILLIAC IV
 ILLIAC IV Architecture
 Super Computers
 Cray X1
 Multimedia Extension
Parallel Processors
 In computers, parallel processing is the processing
 of program instructions by dividing them among
 multiple processors with the objective of running a program in
less time.

 In the earliest computers, only one program ran at a time. A


computation-intensive program that took one hour to run and a
tape copying program that took one hour to run would take a
total of two hours to run. An early form of parallel processing
allowed the interleaved execution of both programs together.

 The computer would start an I/O operation, and while it was


waiting for the operation to complete, it would execute the
processor- intensive program. The total execution time for the
two jobs would be a little over one hour.
Flynn's taxonomy
 Flynn's taxonomy is a classification of
computer architectures, proposed by
Michael J. Flynn in 1966.The
classification system has stuck, and
has been used as a tool in design of
modern processors and their
functionalities.
Classification
 The four classifications defined by Flynn are based
upon the number of concurrent instruction (or control)
streams and data streams available in the architecture.
 Single instruction stream single data stream (SISD)
 Single instruction stream, multiple data streams
(SIMD)
 Single instruction, multiple threads (SIMT)
 Multiple instruction streams, single data stream
(MISD).
Evolution of Intel Vector Instructions
■ MMX (1996, Pentium)
 CPU-based MPEG decoding
 Integers only, 64-bit divided into 2 x 32 to 8 x 8
 Phased out with SSE4
■ SSE (1999, Pentium III)
 CPU-based 3D graphics
 4-way float operations, single precision
 8 new 128 bit Register, 100+ instructions
■ SSE2 (2001, Pentium 4)
 High-performance computing
 Adds 2-way float ops, double-precision; same registers as 4-way single-precision
 Integer SSE instructions make MMX obsolete
■ SSE3 (2004, Pentium 4E Prescott)
 Scientific computing
 New 2-way and 4-way vector instructions for complex
arithmetic
■ SSSE3 (2006, Core Duo)
 Minor advancement over SSE3
■ SSE4 (2007, Core2 Duo Penryn)
 Modern codecs, cryptography
 New integer instructions
 Better support for unaligned data, super shuffle engine
What is SIMD?
 Single instruction, multiple data (SIMD), is a class
 of parallel computers in Flynn's taxonomy.

 It describes computers with multiple processing


elements that perform the same operation on multiple
data points simultaneously. Thus, such machines
exploit data level parallelism.

 There are simultaneous (parallel) computations, but


only a single process (instruction) at a given moment.
How SIMD processes?
Processing/Working
Types of Processing
 Scalar Processing
 A CPU that performs computations on one number or
set of data at a time. A scalar processor is known as a
"single instruction stream single data stream" (SISD)
CPU.
 Vector Processing
 A vector processor or array processor is a
 central processing unit (CPU) that implements an
instruction set containing instructions that operate on
1-D arrays of data called vectors.
Architecture for Vector Processing
 Two architectures suitable for vector processing are:

 Pipelined vector processors


 Parallel Array processors
Pipelined vector processors
 CPU that implements an instruction set that
operates on 1-D arrays, called vectors
 Vectors contain multiple data elements
 Number of data elements per vector is typically
referred to as the vector length
 Both instructions and data are pipelined to reduce
decoding time
Advantages of Vector Processing
Advantages:
 Quick fetch and decode of a single instruction for multiple
operations.
 The instruction provides a regular source of data, which
arrive at
 each cycle, and can be processed in a pipelined fashion
efficiently.
 Easier Addressing of Main Memory
 Elimination of Memory Wastage
 Simplification of Control Hazards
 Reduced Code Size
Array Processors
 ARRAY processor is a processor that performs
computations on a large array of data.
 Array processor is a synchronous parallel
computer with multiple ALU called processing
elements ( PE) that can operate in parallel in
lockstep fashion.
 It is composed of N identical PE under the control
of a single control unit and a number of memory
modules
Array Processor Classification
SIMD ( Single Instruction Multiple Data )
is an array processor that has a single instruction
multiple data organization.
It manipulates vector instructions by means of multiple
functional unit responding to a common instruction.
Attached array processor
is an auxiliary processor attached to a general purpose
computer.
Its intent is to improve the performance of the host
computer in specific numeric calculation tasks.
SIMD-Array Processor Architecture
 SIMD has two basic configuration
 Array processors using RAM also known as
( Dedicated memory organization ).
 • ILLIAC-IV, CM-2,MP-1
 Associative processor using content accessible
memory also known as
 ( Global Memory Organization)
 • BSP
MMX
Multi Media Extensions
Development
 MMX (Multimedia Extension) was introduced in
1996 (Pentium with MMX and Pentium II).
 SSE (Streaming SIMD Extension) was introduced
with Pentium III.
 SSE2 was introduced with Pentium 4.
 SSE3 was introduced with Pentium 4 supporting
hyper-threading technology. SSE3 adds 13 more
instructions.
MMX
 After analyzing a lot of existing applications such as
graphics, MPEG, music, speech recognition, game,
image processing, they found that many multimedia
algorithms execute the same instructions on many
pieces of data in a large data set.
 Typical elements are small, 8 bits for pixels, 16 bits
for audio, 32 bits for graphics and general computing.
 New data type: 64-bit packed data type. Why 64 bits?
 Good enough
 Practical
Data Types of MMX
The four MMX technology data types are:
 Packed byte -- Eight bytes packed into one 64-bit
quantity.
 Packed word -- Four 16-bit words packed into one
64-bit quantity.
 Packed doubleword -- Two 32-bit double words
packed into one 64-bit quantity.
 Quadword -- One 64-bit quantity.
Compatibility
 To be fully compatible with existing IA, no new mode
or state was created. Hence, for context switching, no
extra state needs to be saved.
 To reach the goal, MMX is hidden behind FPU. When
floating-point state is saved or restored, MMX is
saved or restored.
 It allows existing OS to perform context switching on
the processes executing MMX instruction without be
aware of MMX.
 However, it means MMX and FPU can not be used at
the same time. Big overhead to switch.
 Although Intel defenses their decision on aliasing MMX to
FPU for compatibility. It is actually a bad decision. OS can
just provide a service pack or get updated.
 It is why Intel introduced SSE later without any aliasing.
Saturation Arithmetic
 In an 8-bit grayscale picture, 255 is the value for pure white, and 0 is the
value for pure black. In a regular register (AX, BX, CX ...) if we add one
to white, we get black! This is because the regular registers "roll-over" to
the next value. MMX registers get around this by a technique called
"Saturation Arithmetic". In saturation arithmetic, the value of the register
never rolls over to 0 again. This means that in the MMX world, we have
the following equations:
 255 + 100 = 255
 200 + 100 = 255
 0 - 100 = 0;
 99 - 100 = 0
 This may seem counter-intuitive at first to people who are used to their
registers rolling over, but it makes sense in some situations: if we try to
make white brighter, it shouldn't become black.
MMX Registers
 MMX defines eight registers, called MM0
through MM7, and operations that operate
on them. Each register is 64 bits wide and
can be used to hold either 64-bit integers, or
multiple smaller integers in a "packed"
format: a single instruction can then be
applied to two 32-bit integers, four 16-bit
integers, or eight 8-bit integers at once.
Instructions
 The MMX registers are 64 bits wide, but can be broken down as
follows:
 2 32 bit values 4 16 bit values 8 8 bit values The MMX registers cannot
easily be used for 64 bit arithmetic. Let's say that we have 4 bytes
loaded in an MMX register: 10, 25, 128, 255. We have them arranged
as such:
 MM0: | 10 | 25 | 128 | 255 |
 And we do the following pseudo code operation:
 MM0 + 10
 We would get the following result:
 MM0: | 10+10 | 25+10 | 128+10 | 255+10 | = | 20 | 35 | 138 | 255 |
Remember that our arithmetic "saturates" in the last box, so the value
doesn't go over 255.
 Using MMX, we are essentially performing 4 additions in the time it
takes to perform 1 addition using the regular registers, using 4 times
fewer instructions.
MMX Instructions

You might also like