Week 5

Parallel Computing Landscape
(CS 526)
Muhammad Awais,
Department of Computer Science,

The University of Lahore,
Cache Arch
The cache coherence problem
• Since, we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a

monolithic array, shared by all the cores
Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more One or more One or more One or more

levels of levels of levels of levels of
cache cache cache cache
multi-core chip
Main memory
x=15213
35
Core 1 reads x

x=15213
multi-core chip
Main memory
x=15213
36
Core 2 reads x

x=15213 x=15213
multi-core chip
Main memory
x=15213
Core 1 writes to x, setting it to 21660

x=21660 x=15213
multi-core chip
Main memory
x= 15213
Core 2 attempts to read x… gets a stale copy
Has many solutions:

Cache Coherence Protocols
x=21660 x=15213
multi-core chip
Main memory
x=15213
In the MESI protocol:
• Modified (M): The cache line is only present in the current cache and
has been modified. It needs to be written back to main memory.
• Exclusive (E): The cache line is only present in the current cache and
has not been modified.
• Shared (S): The cache line is present in multiple caches, and it's not
modified.
• Invalid (I): The cache line is invalid or not present in the current
cache.
Modified (M):
• This state occurs when a processor has a cache line that has been
modified, meaning it's different from the corresponding data in the
main memory.
• When a processor writes to a cache line in the M state, it must
eventually write it back to main memory to keep it coherent.
• Other caches cannot have this line in any state (E, S, or I) because it's
modified and they would have stale data.
Exclusive (E):
• This state occurs when a cache line is present in only one cache, and
it hasn't been modified.
• In this state, the cache has the most up-to-date copy of the data
compared to other caches.
• The data can be safely read or modified because it's not being used
elsewhere.
Shared (S):
• This state occurs when multiple caches have a copy of the same data,
and none of them have modified it.
• The data is consistent among all caches in the shared state
Invalid (I):
• If a cache line is marked as invalid, it means that it must be fetched
from main memory or another cache before it can be used.
some scenarios to see how MESI protocol
works:
• Read Miss:
• If a processor tries to read a memory location and it's not in the cache
(I state), it must fetch it from main memory. It will transition to the E
or S state depending on whether other caches also request the same
data.
Write
• Write Operation:
• If a processor wants to write to a memory location, it first checks its
cache. If the data is in the E or M state, the write can proceed directly.
If it's in the S state, it must issue an invalidate request to the other
caches holding the data
Level 1 to level 2
• Cache-to-Cache Transfer:
• If a cache wants to read a memory location that's already in another
cache (E or M state), it sends a request to the owning cache. The
owning cache responds by sending the data. The requesting cache
then transitions to the S state.
Eviction
• Cache Eviction:
• When a cache line is evicted (replaced) from a cache, it might need to
be written back to main memory if it's in the M state.
Other Protocols
• MSI protocol (Modified, Shared, Invalid)
• MOSI protocol (Modified, Owned, Shared, Invalid)
• MESI protocol (Modified, Exclusive, Shared, Invalid)
• MOESI protocol (Modified, Owned, Exclusive, Shared, Invalid)
Programming for Multi-core
• Programmers must use threads or processes
• Spread the workload across multiple cores
• Write parallel algorithms
• OS will map threads/processes to cores

Thread safety very important
• Pre-emptive context-switching:
context switch can happen AT ANY TIME
• True concurrency, not just uniprocessor time-

slicing
• Concurrency bugs exposed much faster with

multi-core
Assigning threads to the
cores
• Each thread/process has an affinity mask
• Affinity mask specifies what cores the thread is

allowed to run on
• Different threads can have different masks
• Affinities are inherited across fork( )

Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT
1 1 0 1
core 3 core 2 core 1 core 0
•Process/thread is allowed to run on

cores 0,2,3, but not on core 1
Affinity masks when multi-core and SMT
combined
1 1 0 0 1 0 1 1
core 3 core 2 core 1 core 0
thread thread thread thread thread thread thread thread

1 0 1 0 1 0 1 0
• Core 2 can’t run the process

• Core 1 can only use one simultaneous thread
Default Affinities
• Default affinity mask is all 1s:
all threads can run on all processors
• Then, the OS scheduler decides what threads run

on what core
• OS scheduler detects skewed workloads,

migrating threads to less busy processors
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated

• OS scheduler tries to avoid migration as much as
possible:
it tends to keeps a thread on the same core
• This is called soft affinity

46
Hard Affinities
• The programmer can prescribe her own affinities

(hard affinities)
• Rule of thumb: use the default scheduler unless a

good reason not to
When to set your own affinities
• Two (or more) threads share data-structures
in memory
–map to same core so that can share cache
• Real-time threads:
Example: a thread running
a robot controller:
- must not be context switched,
Source: Sensable.com
or else robot can go unstable
- dedicate an entire core just to this thread
Kernel scheduler API
#include <sched.h>
#include <bits/stdc++.h>
#include <chrono>
#include <pthread.h>
#include <unistd.h>
using namespace std::chrono;
const long NPROCESSORS = sysconf(_SC_NPROCESSORS_ONLN);

cpu_set_t cores
#include <sched.h>
#include <chrono>
#include <unistd.h>
auto start_serial_time = high_resolution_clock::now();

2d matrix multiplication
auto end_serial_time = high_resolution_clock::now();
#include <sched.h>
#include <chrono>
#include <unistd.h>

cpu_set_t cores
CPU_ZERO(&cores);
CPU_SET(i, &cores);
#include <sched.h>
#include <chrono>
#include <unistd.h>
CPU_ZERO(&cores);
CPU_SET(i, &cores);
pthread_attr_t attr;
pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cores);
pthread_create(th1, &attr, ChildThread, NULL);
Kernel scheduler
API
#include <sched.h>
int sched_getaffinity(pid_t pid,
unsigned int len, unsigned long * mask);
Retrieves the current affinity mask of process ‘pid’ and stores it

into space pointed to by ‘mask’.
‘len’ is the system word size: sizeof(unsigned int long)
Kernel scheduler
API
#include <sched.h>
int sched_setaffinity(pid_t pid,
unsigned int len, unsigned long * mask);
Sets the current affinity mask of process ‘pid’ to *mask

‘len’ is the system word size: sizeof(unsigned int long)
To query affinity of a running process:

[~]$ taskset -p 3935
pid 3935's current affinity mask: f
core 0,1, have (affinity 1)
51
Flynn’s Taxonomy
• Michael Flynn (from Stanford)
– Made a characterization of computer systems
which became known as Flynn’s Taxonomy
Computer
Instructions Data
Multiple Processor
Organization
Flynn’s Taxonomy:
1.Single Instruction, Single Data stream - SISD
2.Single Instruction, Multiple Data stream - SIMD
3.Multiple Instruction, Single Data stream - MISD
4.Multiple Instruction, Multiple Data stream- MIMD
1. Single Instruction, Single Data Stream – SISD
• Single processor
• Single instruction stream
• Data stored in single memory
• Example: Uni-processor Systems
SI SISD SD
2. Single Instruction, Multiple Data Stream - SIMD
• Single machine instruction

• Controls simultaneous execution
• Large Number of processing elements
• Each processing element has associated memory
• Each instruction executed on different set of data
by different processors
• Examples: GPUs
SISD SD
SI SISD SD
SISD SD
3. Multiple Instruction, Single Data Stream - MISD
• Sequence of data
• Transmitted to set of processors
• Each processor executes different instruction
sequence, using same Data
• Few examples: Systolic array Processors
SI SISD
SI SISD SD
SI SISD
4. Multiple Instruction, Multiple Data Stream- MIMD
• Set of processors
• Simultaneously execute different instruction
sequences
• Different sets of data
• Examples: Multi-cores, SMPs, Clusters
SI SISD SD
SI SISD SD
SI SISD SD
MIMD - Overview
• General purpose processors
• Each can process all instructions necessary
• Further classified by method of processor
communication:
1. Via Shared Memory
2. Message Passing (Distributed Memory)
Taxonomy of Processor
Architectures
Taxonomy of Processor
Architectures
Tightly Coupled - SMP
• Processors share memory
• Communicate via that shared memory
• Symmetric Multiprocessor (SMP)

– Single shared memory
– Shared bus to access memory
– Memory access time to given area of memory is
approximately the same for each processor
Symmetric Multiprocessors
(SMPs)
– Two or more similar processors
– Processors share same memory and I/O
– Processors are connected by a bus or other internal
connection
– Memory access time is approximately the same for
each processor
– All processors share access to I/O
SMP Advantages
• Performance
– If some work can be done in parallel
• Availability
– Failure of a single processor does not halt the system
• Incremental growth
– User can enhance performance by adding additional
processors
• Scaling
– Vendors can offer range of products based on number
of processors
Block Diagram of Tightly Coupled Multiprocessor
(SMP)
Symmetric Multiprocessor Organization
Multithreading and Chip Multiprocessors
• Instruction stream divided into smaller streams
called “threads”
• Executed in parallel
• Process:Definitions of Threads and Processes
– An instance of program running on computer
– A unit of resource ownership:
• virtual address space to hold process image
– Process switch
• Thread: dispatchable unit of work within process

– Includes processor context (which includes the PC register
and stack pointer) and data area for stack
– Interruptible: processor can turn to another thread
• Thread switch
– Switching processor between threads within same process
– Typically less costly than process switch
Implicit and Explicit
Multithreading
• All commercial processors use explicit
multithreading:
– Concurrently execute instructions from different
explicit threads
– Interleave instructions from different threads on shared
pipelines OR parallel execution on parallel pipelines
• Implicit multithreading: Concurrent execution of

multiple threads extracted from single sequential
program:
– Implicit threads defined statically by compiler or
dynamically by hardware
– Examples: Intel C++ Compiler, YUCCA, Par4All, etc

Week 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 5

Uploaded by

Copyright:

Available Formats

Parallel Computing Landscape

Department of Computer Science,

• Each core should perceive the memory as a

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more

Has many solutions:

• Spread the workload across multiple cores

• Write parallel algorithms

• OS will map threads/processes to cores

• True concurrency, not just uniprocessor time-

• Concurrency bugs exposed much faster with

• Affinity mask specifies what cores the thread is

• Different threads can have different masks

• Affinities are inherited across fork( )

• Example: 4-way multi-core, without SMT

core 3 core 2 core 1 core 0

•Process/thread is allowed to run on

core 3 core 2 core 1 core 0

thread thread thread thread thread thread thread thread

• Core 2 can’t run the process

• Then, the OS scheduler decides what threads run

• OS scheduler detects skewed workloads,

• Need to restart the execution pipeline

• Cached data is invalidated

• This is called soft affinity

• The programmer can prescribe her own affinities

• Rule of thumb: use the default scheduler unless a

using namespace std::chrono;

const long NPROCESSORS = sysconf(_SC_NPROCESSORS_ONLN);

using namespace std::chrono;

auto start_serial_time = high_resolution_clock::now();

using namespace std::chrono;

Retrieves the current affinity mask of process ‘pid’ and stores it

Sets the current affinity mask of process ‘pid’ to *mask

To query affinity of a running process:

• Single machine instruction

• Symmetric Multiprocessor (SMP)

• Thread: dispatchable unit of work within process

• Implicit multithreading: Concurrent execution of

You might also like