Centre for Computer Technology

ICT123 Computer Architecture
Week 12

Parallel Processing

Contents at a Glance
Review of week11  Threads and Processes  Implicit and Explicit Multithreading  Clusters  Parallelizing  Vector Computation

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Shared-Memory Architecture: Cache Coherence
 

 

Problem - multiple copies of same data may reside in some caches and the main memory If the processors are allowed to update their own copies (in the cache) freely, may result in an inconsistent view of memory This results in a cache coherence problem Hence multiple copies of the data (in cache) have to be kept identical
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cache Coherence Protocols

The objective is to update recently used local variables in the cache and let them reside through numerous reads and write The protocol maintains consistency of shared variables in multiple caches at the same time Cache Coherence approaches
 

Software solutions Hardware solutions
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

 

March 20, 2012

Collect and maintain information about copies of data in cache There is a centralized controller, that is part of a main memory controller, and a directory stored in main memory When a request is made, the centralized controller checks and issues necessary commands for data transfer between memory and cache or between caches The central controller keeps the state information updated Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute

Directory Protocol

Snoopy Protocol
Distribute cache coherence responsibility among cache controllers  Cache recognizes that a line is shared  Updates announced to other caches by broadcast mechanism  Each cache controller snoops on the network to observe the broadcast notifications and react accordingly

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

MESI Protocol

Modified : line in the cache has been modified (different from mail memory) and is available only in this line Exclusive : line in the cache is the same as that in main memory and is not present in any other cache Shared : line in the cache is the same as that in main memory and may be present in another cache Invalid : line in the cache does not contain valid data
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

MESI State Transition Diagram

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

CPU performance
seconds
CPU Execution = Time Instruction Clock Cycle X CPI X Count Time

One of P&H’s “big pictures”

instructions/program

cycles/instruction

seconds/cycle

Note: CPI is somewhat artificial
(it’s computed from the other numbers using this formula)

but it’s an intuitive and useful concept. Note: Use dynamic instruction count (#instructions executed), not static (#instructions in compiled code)
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide9

Explaining performance variation
CPU Execution = Time Instruction Clock Cycle X CPI X Count Time

Same machine, different programs Same program, different machines, but same ISA Same program, different ISA’s
March 20, 2012 Slide10

CSE 141 - Performance I and II Copyright Box Hill Institute

How do you judge computer performance?
 

Clock speed?

No
Unless ISA is same

Peak MIPS rate?

No Sometimes (if program tested is like yours) The best method!
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide11


Relative MIPS, normalized MFLOPS?

How fast does it execute MY program

March 20, 2012

 

Physics: speed of light, size of atoms, heat generated (speed requires energy loss), capacity of electromagnetic spectrum (for wireless), ... Limits with current technology: size of magnetic domains, chip size (due to defects), lithography, pin count. New technologies on the horizon: quantum computers, molecular computers, superconductors, optical computers, holographic storage, ... Fallacy – improvements will stop Pitfall – trying to predict > 5 years in future
March 20, 2012

What are limits?

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide12

Centre for Computer Technology

Processor Performance

Increasing Performance (1)
 

Processor performance can be measured by the rate at which it executes instructions

MIPS rate = f * IPC

f processor clock frequency, in MHz  IPC is average instructions per cycle

Increase performance by increasing clock frequency  Increasing instructions that complete during a cycle (pipelining)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Increasing Performance (2)

May be reaching limit due to
Complexity  Power consumption
 

Alternative approach

March 20, 2012

Wide variety of multithreading designs
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Instruction stream divided into smaller streams called threads  The threads can be executed in parallel (multithreading)

Threads and Processes (1)

Thread in multithreaded processors may or may not be same as software threads
Process is an instance of a program running on a computer A process has two main characteristics

 

March 20, 2012

Process switch, an operation that switches the processor from one process to the other
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Resource ownership, Virtual address space to hold process image (collection of program, data, stack and other attributes ) Scheduling/execution (execution state, dispatching priority)

Threads and Processes (2)

Thread – a dispatchable unit of work within a process

 

Includes processor context (which includes the program counter and stack pointer) and data area for stack Thread executes sequentially Interruptible: processor can turn to another thread Switching processor between threads within same process Typically less costly than process switch
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Thread switch
 

March 20, 2012

Explicit Multithreading

All commercial processors and most experimental ones use explicit multithreading Concurrently execute instructions from different explicit threads  Interleave instructions from different threads on shared pipelines  Parallel execution on parallel pipelines

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Implicit Multithreading

Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program

Implicit threads are defined
Statically by the compiler or  Dynamically by the hardware

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (1)

Interleaved
Also known as Fine-grained multithreading  Processor deals with two or more thread contexts at a time  Switching thread at each clock cycle  If thread is blocked it is skipped and a ready thread is executed

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (2)

Blocked
Also known as Coarse-grained multithreading  Thread are executed successively until an event occurs that causes delay  e.g. Cache miss  Effective on in-order processor  Avoids pipeline stall

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (3)

(a) – (c) Three threads
(empty boxes indicate that the thread has stalled waiting for memory)

(d) Fine-grained multithreading (e) Coarse-grained multithreading
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (4)

Multithreading with a dual-issue superscalar CPU (a) Fine-grained multithreading (b) Coarse-grained multithreading (c) Simultaneous multithreading
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Explicit Multithreading (5)

Simultaneous (SMT)

Instructions simultaneously issued from multiple threads to execution units of superscalar processor

Chip multiprocessing
Processor is replicated on a single chip  Each processor handles separate threads

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Scalar Processor Approaches (1)

Single-threaded scalar
Simple pipeline  No multithreading

Interleaved multithreaded scalar

March 20, 2012

Easiest multithreading to implement  Switch threads at each clock cycle  Pipeline stages kept close to fully occupied  Hardware needs to switch thread context between cyclesRichard Salomon, Sudipto Mitra
Copyright Box Hill Institute

Scalar Processor Approaches (2)

Blocked multithreaded scalar
Thread executed until latency event occurs  Would stop pipeline  Processor switches to another thread

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Scalar Diagrams

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Multiple Instruction Issue Processors (1)

Superscalar

No multithreading

Interleaved multithreading superscalar:
  

Each cycle, as many instructions as possible issued from single thread Delays due to thread switches eliminated Number of instructions issued in cycle limited by dependencies
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

Multiple Instruction Issue Processors (2)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Multiple Instruction Issue Processors (3)

Blocked multithreaded superscalar

Instructions from one thread Blocked multithreading used

e.g. IA-64  Multiple instructions in single word  Typically constructed by compiler  Operations that may be executed in parallel in same word  May pad with no-ops March 20, 2012 Richard Salomon, Sudipto Mitra

Copyright Box Hill Institute

Very long instruction word (VLIW)

Multiple Instruction Issue Processors (4)

Interleaved multithreading VLIW

Similar efficiencies to interleaved multithreading on superscalar architecture

Blocked multithreaded VLIW

Similar efficiencies to blocked multithreading on superscalar architecture

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Multiple Instruction Issue Processors (5)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel, Simultaneous Execution of Multiple Threads (1)

Simultaneous multithreading
Issue multiple instructions at a time  One thread may fill all horizontal slots  Instructions from two or more threads may be issued  With enough threads, can issue maximum number of instructions on each cycle

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel, Simultaneous Execution of Multiple Threads (2)

Chip multiprocessor
Multiple processors  Each has two-issue superscalar processor  Each processor is assigned thread

 Can

issue up to two instructions per cycle per thread

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel, Simultaneous Execution of Multiple Threads (3)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Examples

Some Pentium 4
  

Intel calls it hyper-threading SMT with support for two threads Single multithreaded processor, logically two processors High-end PowerPC Combines chip multiprocessing with SMT Chip has two separate processors Each supporting two threads concurrently using SMT
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

IBM Power5


 

March 20, 2012

Hyperthreading on the Pentium 4

Resource sharing between threads in the Pentium 4 NetBurst microarchitecture
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Power5 Instruction Data Flow

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Clusters (1)
Alternative to SMP  High performance  High availability  Server applications

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Clusters (2)
A group of interconnected whole computers  Working together as unified resource  Illusion of being one machine  Each computer called a node

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cluster Benefits
Absolute scalability  Incremental scalability  High availability  Superior price/performance

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cluster Configurations - Standby Server, No Shared Disk

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cluster Configurations Shared Disk

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Operating Systems Design Issues (1)

Failure Management  High availability  Fault tolerant  Failover
 Switching

applications & data from failed system to alternative within cluster

Failback
 Restoration

system  After problem is fixed
March 20, 2012

of applications and data to original

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Operating Systems Design Issues (2)

Load balancing
Incremental scalability  Automatically include new computers in scheduling  Middleware needs to recognise that processes may switch between machines

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallelizing (1)

Single application executing in parallel on a number of machines in a cluster  Compiler
 Determines

at compile time which parts can be executed in parallel  Split off for different computers

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallelizing (2)

Application
 Application

written from scratch to be parallel  Message passing to move data between nodes  Hard to program  Best end result

Parametric computing
 If

a problem is repeated, execute the algorithm on different sets of data  e.g. simulation using different scenarios  Needs effective tools to organize and run
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cluster Computer Architecture

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cluster v. SMP (1)
Both provide multiprocessor support to high demand applications.  Both available commercially

SMP for longer

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

SMP:

Cluster v. SMP (2)

Easier to manage and control  Closer to single processor systems
 Scheduling

is main difference  Less physical space  Lower power consumption 

Clustering:
Superior incremental & absolute scalability  Superior availability

March 20, 2012

 Redundancy

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Non-uniform Memory Access (NUMA) (1)

Alternative to SMP & clustering Uniform memory access (UMA) systems
   

All processors have access to all parts of memory Access time to all regions of memory is the same Access time to memory for different processors same As used by SMP All processors have access to all parts of memory

Using load & store

Non-uniform memory access (NUMA) systems
  

Access time of processor differs depending on region of memory Different processors access different regions of memory at different speeds
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Using load & store

March 20, 2012

Non-uniform Memory Access (NUMA) (2)

Cache coherent NUMA
Cache coherence is maintained among the caches of the various processors  Significantly different from SMP and clusters

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Motivation
 

SMP has practical limit to number of processors

Bus traffic limits to between 16 and 64 processors
Apps do not see large global memory Coherence maintained by software not hardware

Each node of a cluster has own memory
 

 

NUMA retains SMP flavour while giving large scale multiprocessing Objective is to maintain transparent system wide memory while permitting multiprocessor nodes, each with own bus or internal interconnection Richard Salomon, Sudipto Mitra system Copyright Box Hill Institute
March 20, 2012

CC-NUMA Operation (1)


  

Each processor has own L1 and L2 cache Each node has own main memory Nodes connected by some networking facility Each processor sees single addressable memory space Memory request order:
   

L1 cache (local to processor) L2 cache (local to processor) Main memory (local to node) Remote memory

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

CC-NUMA Operation (2)
Memory requests delivered to the requesting processor’s local cache  Automatic and transparent  Each node maintains directory of location of portions of memory and cache status

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

CC-NUMA Organization

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Memory Access Sequence Example

node 2 processor 3 (P2-3) requests location 798 which is in memory of node 1

  


  

P2-3 issues read request on snoopy bus of node 2 Directory on node 2 recognises location is on node 1 Node 2 directory requests node 1’s directory Node 1 directory requests contents of 798 Node 1 memory puts data on (node 1 local) bus Node 1 directory gets data from (node 1 local) bus Data transferred to node 2’s directory Node 2 directory puts data on (node 2 local) bus Data picked up, put in P2-3’s cache and delivered to processor
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

Cache Coherence
   

Node 1 directory keeps note that node 2 has copy of data If data modified in cache, this is broadcast to other nodes Local directories monitor and purge local cache if necessary Local directory monitors changes to local data in remote caches and marks memory invalid until write-back Local directory forces write-back if memory location requested by another processor
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Higher effective levels of parallelism than SMP  No major software changes  Performance can breakdown if too much access to remote memory  Can be avoided by:

 L1

NUMA Pros & Cons (1)

March 20, 2012

& L2 cache design reducing all memory access  Good temporal locality of software  Good spatial locality of software  Virtual memory management moving pages to nodes that are usingMitra them most Richard Salomon, Sudipto
Copyright Box Hill Institute

NUMA Pros & Cons (2)

Not transparent

Page allocation, process allocation and load balancing changes needed

Availability?

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Vector Computation

Maths problems involving physical processes that present different difficulties for computation

 

High precision Repeated floating point calculations on large arrays of numbers

Aerodynamics, seismology, meteorology Continuous field simulation

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Vector Computation Handling

Supercomputers
  

Hundreds of millions of flops $10-15 million Optimised for calculation rather than multitasking and I/O Limited market  Research, government agencies, meteorology Alternative to supercomputer Configured as peripherals to mainframe & mini Just run vector portion of problems
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Array processor
  

March 20, 2012

Vector Addition Example

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches (1)
  

General purpose computers rely on iteration to do vector calculations In above example this needs six calculations Vector processing

Assume possible to operate on one-dimensional vector of data All elements in a particular row can be calculated in parallel

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches (2)

Parallel processing
Independent processors functioning in parallel  Use FORK N to start individual process at location N  JOIN N causes N independent processes to join and merge following JOIN

 O/S

March 20, 2012

Co-ordinates JOINs  Execution is blocked until all N processes have reached JOIN Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute

Processor Designs

Pipelined ALU
Within operations  Across operations

Parallel ALUs  Parallel processors

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Approaches to Vector Computation

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Computer Organizations

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

IBM 3090 with Vector Facility

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Summary

Processor performance can be measured by the rate at which it executes instructions,

Alternative approach to improve performance, Instruction stream divided into smaller streams called threads and execute them in parallel (multithreading)  Clustering is an alternative to symmetric multiprocessing for providing high performance and high availability.

March 20, 2012

MIPS rate = f * IPC

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Reference

  

Stallings William, 2003, Computer Organization & Architecture designing for performance, Sixth Edition, Pearson Education, Inc, ISBN 0 - 13 049307 – 4. M Morris Mano, Computer System Architecture, Third Edition, Prentice Hall. Measuring Performance, UCSD, CSE 141, Larry Carter, Winter 2002 Tanenbaum, Structured Computer Organization, Fifth Edition, 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0. CS 284a Lecture, Tuesday, 7 October 1997, John Thornley.
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Further Reading
Manufacturers’ websites  Relevant Special Interest Groups [SIG]  Articles in magazines  IEEE Computer Society Task Force on Cluster Computing web-site

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Sign up to vote on this title
UsefulNot useful