SIMD Presentation

SIMD
Single instruction, multiple

data (SIMD)
Contents
 Parallel Processors
 Flynn's taxonomy
 What is SIMD?
 Types of Processing
 Scalar Processing
 Vector Processing
 Architecture for Vector Processing
 Vector processors
 Vector Processor Architectures
 Components of Vector Processors
 Advantages of Vector Processing
 Array processors
 Array Processor Classification
 Array Processor Architecture
 Dedicated Memory Organization
 Global Memory Organization
 ILLIAC IV
 ILLIAC IV Architecture
 Super Computers
 Cray X1
 Multimedia Extension
Parallel Processors
 In computers, parallel processing is the processing
 of program instructions by dividing them among
 multiple processors with the objective of running a program in
less time.
 In the earliest computers, only one program ran at a time. A

computation-intensive program that took one hour to run and a
tape copying program that took one hour to run would take a
total of two hours to run. An early form of parallel processing
allowed the interleaved execution of both programs together.
 The computer would start an I/O operation, and while it was

waiting for the operation to complete, it would execute the
processor- intensive program. The total execution time for the
two jobs would be a little over one hour.
Flynn's taxonomy
 Flynn's taxonomy is a classification of
computer architectures, proposed by
Michael J. Flynn in 1966.The
classification system has stuck, and
has been used as a tool in design of
modern processors and their
functionalities.
Classification
 The four classifications defined by Flynn are based
upon the number of concurrent instruction (or control)
streams and data streams available in the architecture.
 Single instruction stream single data stream (SISD)
 Single instruction stream, multiple data streams
(SIMD)
 Single instruction, multiple threads (SIMT)
 Multiple instruction streams, single data stream
(MISD).
Evolution of Intel Vector Instructions
■ MMX (1996, Pentium)
 CPU-based MPEG decoding
 Integers only, 64-bit divided into 2 x 32 to 8 x 8
 Phased out with SSE4
■ SSE (1999, Pentium III)
 CPU-based 3D graphics
 4-way float operations, single precision
 8 new 128 bit Register, 100+ instructions
■ SSE2 (2001, Pentium 4)
 High-performance computing
 Adds 2-way float ops, double-precision; same registers as 4-way single-precision
 Integer SSE instructions make MMX obsolete
■ SSE3 (2004, Pentium 4E Prescott)
 Scientific computing
 New 2-way and 4-way vector instructions for complex
arithmetic
■ SSSE3 (2006, Core Duo)
 Minor advancement over SSE3
■ SSE4 (2007, Core2 Duo Penryn)
 Modern codecs, cryptography
 New integer instructions
 Better support for unaligned data, super shuffle engine
What is SIMD?
 Single instruction, multiple data (SIMD), is a class
 of parallel computers in Flynn's taxonomy.
 It describes computers with multiple processing

elements that perform the same operation on multiple
data points simultaneously. Thus, such machines
exploit data level parallelism.
 There are simultaneous (parallel) computations, but

only a single process (instruction) at a given moment.
How SIMD processes?
Processing/Working
Types of Processing
 Scalar Processing
 A CPU that performs computations on one number or
set of data at a time. A scalar processor is known as a
"single instruction stream single data stream" (SISD)
CPU.
 Vector Processing
 A vector processor or array processor is a
 central processing unit (CPU) that implements an
instruction set containing instructions that operate on
1-D arrays of data called vectors.
Architecture for Vector Processing
 Two architectures suitable for vector processing are:
 Pipelined vector processors

 Parallel Array processors
Pipelined vector processors
 CPU that implements an instruction set that
operates on 1-D arrays, called vectors
 Vectors contain multiple data elements
 Number of data elements per vector is typically
referred to as the vector length
 Both instructions and data are pipelined to reduce
decoding time
Advantages of Vector Processing
Advantages:
 Quick fetch and decode of a single instruction for multiple
operations.
 The instruction provides a regular source of data, which
arrive at
 each cycle, and can be processed in a pipelined fashion
efficiently.
 Easier Addressing of Main Memory
 Elimination of Memory Wastage
 Simplification of Control Hazards
 Reduced Code Size
Array Processors
 ARRAY processor is a processor that performs
computations on a large array of data.
 Array processor is a synchronous parallel
computer with multiple ALU called processing
elements ( PE) that can operate in parallel in
lockstep fashion.
 It is composed of N identical PE under the control
of a single control unit and a number of memory
modules
Array Processor Classification
SIMD ( Single Instruction Multiple Data )
is an array processor that has a single instruction
multiple data organization.
It manipulates vector instructions by means of multiple
functional unit responding to a common instruction.
Attached array processor
is an auxiliary processor attached to a general purpose
computer.
Its intent is to improve the performance of the host
computer in specific numeric calculation tasks.
SIMD-Array Processor Architecture
 SIMD has two basic configuration
 Array processors using RAM also known as
( Dedicated memory organization ).
 • ILLIAC-IV, CM-2,MP-1
 Associative processor using content accessible
memory also known as
 ( Global Memory Organization)
 • BSP
MMX
Multi Media Extensions
Development
 MMX (Multimedia Extension) was introduced in
1996 (Pentium with MMX and Pentium II).
 SSE (Streaming SIMD Extension) was introduced
with Pentium III.
 SSE2 was introduced with Pentium 4.
 SSE3 was introduced with Pentium 4 supporting
hyper-threading technology. SSE3 adds 13 more
instructions.
MMX
 After analyzing a lot of existing applications such as
graphics, MPEG, music, speech recognition, game,
image processing, they found that many multimedia
algorithms execute the same instructions on many
pieces of data in a large data set.
 Typical elements are small, 8 bits for pixels, 16 bits
for audio, 32 bits for graphics and general computing.
 New data type: 64-bit packed data type. Why 64 bits?
 Good enough
 Practical
Data Types of MMX
The four MMX technology data types are:
 Packed byte -- Eight bytes packed into one 64-bit
quantity.
 Packed word -- Four 16-bit words packed into one
64-bit quantity.
 Packed doubleword -- Two 32-bit double words
packed into one 64-bit quantity.
 Quadword -- One 64-bit quantity.
Compatibility
 To be fully compatible with existing IA, no new mode
or state was created. Hence, for context switching, no
extra state needs to be saved.
 To reach the goal, MMX is hidden behind FPU. When
floating-point state is saved or restored, MMX is
saved or restored.
 It allows existing OS to perform context switching on
the processes executing MMX instruction without be
aware of MMX.
 However, it means MMX and FPU can not be used at
the same time. Big overhead to switch.
 Although Intel defenses their decision on aliasing MMX to
FPU for compatibility. It is actually a bad decision. OS can
just provide a service pack or get updated.
 It is why Intel introduced SSE later without any aliasing.
Saturation Arithmetic
 In an 8-bit grayscale picture, 255 is the value for pure white, and 0 is the
value for pure black. In a regular register (AX, BX, CX ...) if we add one
to white, we get black! This is because the regular registers "roll-over" to
the next value. MMX registers get around this by a technique called
"Saturation Arithmetic". In saturation arithmetic, the value of the register
never rolls over to 0 again. This means that in the MMX world, we have
the following equations:
 255 + 100 = 255
 200 + 100 = 255
 0 - 100 = 0;
 99 - 100 = 0
 This may seem counter-intuitive at first to people who are used to their
registers rolling over, but it makes sense in some situations: if we try to
make white brighter, it shouldn't become black.
MMX Registers
 MMX defines eight registers, called MM0
through MM7, and operations that operate
on them. Each register is 64 bits wide and
can be used to hold either 64-bit integers, or
multiple smaller integers in a "packed"
format: a single instruction can then be
applied to two 32-bit integers, four 16-bit
integers, or eight 8-bit integers at once.
Instructions
 The MMX registers are 64 bits wide, but can be broken down as
follows:
 2 32 bit values 4 16 bit values 8 8 bit values The MMX registers cannot
easily be used for 64 bit arithmetic. Let's say that we have 4 bytes
loaded in an MMX register: 10, 25, 128, 255. We have them arranged
as such:
 MM0: | 10 | 25 | 128 | 255 |
 And we do the following pseudo code operation:
 MM0 + 10
 We would get the following result:
 MM0: | 10+10 | 25+10 | 128+10 | 255+10 | = | 20 | 35 | 138 | 255 |
Remember that our arithmetic "saturates" in the last box, so the value
doesn't go over 255.
 Using MMX, we are essentially performing 4 additions in the time it
takes to perform 1 addition using the regular registers, using 4 times
fewer instructions.
MMX Instructions

SIMD Presentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SIMD Presentation

Uploaded by

Copyright:

Available Formats

SIMD

Single instruction, multiple

 In the earliest computers, only one program ran at a time. A

 The computer would start an I/O operation, and while it was

 It describes computers with multiple processing

 There are simultaneous (parallel) computations, but

 Pipelined vector processors

You might also like