0% found this document useful (0 votes)

33 views29 pages

Implementing Synchronization Primitives

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views29 pages

Implementing Synchronization Primitives

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 16:

Implementing
Synchronization
Parallel Computer Architecture and Programming
CMU 15-418/15-618, Fall 2023
Today’s topic: efficiently implementing
synchronization primitives
▪ Primitives for ensuring mutual exclusion
- Locks
- Atomic primitives (e.g., atomic_add)
- Transactions
▪ Primitives for event signaling
- Barriers
- Flags

CMU 15-418/618,
Fall 2023
Three phases of a synchronization event
1. Acquire method
- How a thread attempts to gain access to protected resource
2. Waiting algorithm
- How a thread waits for access to be granted to shared resource
3. Release method
- How thread enables other threads to gain resource when its
work in the synchronized region is complete

CMU 15-418/618,
Fall 2023
Busy waiting
▪ Busy waiting (a.k.a. “spinning”)
while (condition X not true) {}
logic that assumes X is true

▪ In classes like 15-213 or in operating systems, you have

certainly also talked about synchronization
- You might have been taught busy-waiting is bad: why?

CMU 15-418/618,
Fall 2023
“Blocking” synchronization
▪ Idea: if progress cannot be made because a resource cannot
be acquired, it is desirable to free up execution resources for
another thread (preempt the running thread)
if (condition X not true)
block until true; // OS scheduler de-schedules thread
// (let’s another thread use the processor)

▪ pthreads mutex example

pthread_mutex_t mutex;
pthread_mutex_lock(&mutex);

CMU 15-418/618,
Fall 2023
Busy waiting vs. blocking
▪ Busy-waiting can be preferable to blocking if:
- Scheduling overhead is larger than expected wait time
- Tail latency effects
- Processor’s resources not needed for other tasks
- This is often the case in a parallel program since we usually don’t oversubscribe
a system when running a performance-critical parallel app (e.g., there aren’t
multiple CPU-intensive programs running at the same time)
- Clarification: be careful to not confuse the above statement with the value of
multi-threading (interleaving execution of multiple threads/tasks to hiding
long latency of memory operations) with other work within the same app.
▪ Examples:
pthread_spinlock_t spin; int lock;
pthread_spin_lock(&spin); OSSpinLockLock(&lock); // OSX spin lock

CMU 15-418/618,
Fall 2023
Implementing Locks

CMU 15-418/618,
Fall 2023
Warm up: a simple, but incorrect, lock

lock: ld R0, mem[addr] // load word into R0

cmp R0, #0 // compre R0 to 0
bnz lock // if nonzero jump to top
st mem[addr], #1

unlock: st mem[addr], #0 // store 0 to address

Problem: data race because LOAD-TEST-STORE is not atomic!

Processor 0 loads address X, observes 0
Processor 1 loads address X, observes 0
Processor 0 writes 1 to address X
Processor 1 writes 1 to address X

CMU 15-418/618,
Fall 2023
Test-and-set based lock
Atomic test-and-set instruction:
ts R0, mem[addr] // load mem[addr] into R0
// if mem[addr] is 0, set mem[addr] to 1

lock: ts R0, mem[addr] // load word into R0

bnz R0, lock // if 0, lock obtained

unlock: st mem[addr], #0 // store 0 to address

CMU 15-418/618,
Fall 2023
Test-and-set lock: consider coherence traffic
Processor 0 Processor 1 Processor 2
T&S
BusRdX Invalidate line Invalidate line
Update line in cache (set to 1)
Invalidate line BusRdX T&S

Attempt to update (t&s fails)

Invalidate line BusRdX T&S

Attempt to update (t&s fails)

[P0 is holding lock...] BusRdX T&S Invalidate line
Attempt to update (t&s fails)
Invalidate line BusRdX T&S

Attempt to update (t&s fails)

BusRdX
Invalidate line
Update line in cache (set to 0)
T&S
Invalidate line BusRdX
Update line in cache (set to 1)
= thread has lock CMU 15-418/618,
Fall 2023
Test-and-set lock: consider coherence traffic
Processor 0 Processor 1 Processor 2
T&S
BusRdX Invalidate line Invalidate line
Update line in cache (set to 1)
Invalidate line BusRdX T&S

Attempt to update (t&s fails)

Invalidate line BusRdX T&S

Attempt to update (t&s fails)

[P0 is holding lock...] BusRdX T&S Invalidate line
Attempt to update (t&s fails)
Invalidate line BusRdX T&S

Attempt to update (t&s fails)

BusRdX
Invalidate line
Update line in cache (set to 0)
T&S
Invalidate line BusRdX
Update line in cache (set to 1)
= thread has lock CMU 15-418/618,
Fall 2023
Test-and-set lock performance
Benchmark: execute a total of N lock/unlock sequences (in aggregate) by P processors
Critical section time removed so graph plots only time acquiring/releasing the lock
Benchmark executes:
lock(L);
critical-section(c)
unlock(L);
Time (us)

Bus contention increases amount of

time to transfer lock (lock holder must
wait to acquire bus to release)

Not shown: bus contention also slows

down execution of critical section

Number of processors

Figure credit: Culler, Singh, and Gupta CMU 15-418/618,

Fall 2023
Desirable lock performance characteristics
▪ Low latency
- If lock is free and no other processors are trying to acquire it, a processor should
be able to acquire the lock quickly
▪ Low interconnect traffic
- If all processors are trying to acquire lock at once, they should acquire the lock in
succession with as little traffic as possible
▪ Scalability
- Latency / traffic should scale reasonably with number of processors
▪ Low storage cost
▪ Fairness
- Avoid starvation or substantial unfairness
- One ideal: processors should acquire lock in the order they request access to it
Simple test-and-set lock: low latency (under low contention), high traffic, poor scaling, low storage cost (one int),
no provisions for fairness CMU 15-418/618,
Fall 2023
Test-and-test-and-set lock
void Lock(int* lock) {
while (1) {

while (*lock != 0); // while another processor has the lock...

if (test_and_set(*lock) == 0) // when lock is released, try to acquire it

return;
}
}

void Unlock(volatile int* lock) {

*lock = 0;
}

CMU 15-418/618,
Fall 2023
Test-and-test-and-set lock: coherence traffic
Processor 1 Processor 2 Processor 3
T&S
BusRdX Invalidate line Invalidate line
Update line in cache (set to 1)
BusRd BusRd

[P1 is holding lock...] [Many reads from local cache] [Many reads from local cache]

BusRdX
Update line in cache (set to 0) Invalidate line Invalidate line
Invalidate line BusRd BusRd
BusRdX T&S

Update line in cache (set to 1)

Invalidate line BusRdX T&S
Attempt to update (t&s fails)
= thread has lock CMU 15-418/618,
Fall 2023
Test-and-test-and-set lock: coherence traffic
Processor 1 Processor 2 Processor 3
T&S
BusRdX Invalidate line Invalidate line
Update line in cache (set to 1)
BusRd BusRd

[P1 is holding lock...] [Many reads from local cache] [Many reads from local cache]

BusRdX
Update line in cache (set to 0) Invalidate line Invalidate line
Invalidate line BusRd BusRd
BusRdX T&S

Update line in cache (set to 1)

Invalidate line BusRdX T&S
Attempt to update (t&s fails)
= thread has lock CMU 15-418/618,
Fall 2023
Test-and-test-and-set characteristics
▪ Slightly higher latency than test-and-set in uncontended case
- Must test... then test-and-set
▪ Generates much less interconnect traffic
- One invalidation, per waiting processor, per lock release (O(P) invalidations)
- This is O(P ) interconnect traffic if all processors have the lock cached
2

- Recall: test-and-set lock generated one invalidation per waiting processor per test
▪ More scalable (due to less traffic)
▪ Storage cost unchanged (one int)
▪ Still no provisions for fairness

CMU 15-418/618,
Fall 2023
Test-and-set lock with back off
Upon failure to acquire lock, delay for awhile before retrying
void Lock(volatile int* l) {
int amount = 1;
while (1) {
if (test_and_set(*l) == 0)
return;
delay(amount);
amount *= 2;
}
}

▪ Same uncontended latency as test-and-set, but potentially higher latency under

contention. Why?
▪ Generates less traffic than test-and-set (not continually attempting to acquire lock)
▪ Improves scalability (due to less traffic)
▪ Storage cost unchanged (still one int for lock)
▪ Exponential back-off can cause severe unfairness
- Newer requesters back off for shorter intervals
CMU 15-418/618,
Fall 2023
Ticket lock
Main problem with test-and-set style locks: upon release,
all waiting processors attempt to acquire lock using test-
and-set

struct lock {
volatile int next_ticket;
volatile int now_serving;
};

void Lock(lock* l) {
int my_ticket = atomic_increment(&l->next_ticket); // take a “ticket”
while (my_ticket != l->now_serving); // wait for number
} // to be called

void unlock(lock* l) {
l->now_serving++;
}

No atomic operation needed to acquire the lock (only a read)

Result: only one invalidation per lock release (O(P) interconnect traffic)
CMU 15-418/618,
Fall 2023
Array-based lock
Each processor spins on a different memory address
Utilizes atomic operation to assign address on attempt to acquire
struct lock {
volatile padded_int status[P]; // padded to keep off same cache line
volatile int head;
};

int my_element;

void Lock(lock* l) {
my_element = atomic_circ_increment(&l->head); // assume circular increment
while (l->status[my_element] == 1);
}

void unlock(lock* l) {
l->status[my_element] = 1;
l->status[circ_next(my_element)] = 0; // next() gives next index
}

O(1) interconnect traffic per release, but lock requires space linear in P
Also, the atomic circular increment is a more complex operation (higher overhead)
CMU 15-418/618,
Fall 2023
x86 cmpxchg
▪ Compare and exchange (atomic when used with lock prefix)
lock cmpxchg dst, src
often a memory address

lock prefix (makes operation atomic)

x86 accumulator register e.g., eax
if dst == accumulator
ZF = 1 flag register 1. Does the dst have the value we think it has?
dst = src
2. Then make the update
else
ZF = 0
accumulator = dst
3. If not return the current value

CMU 15-418/618,
Fall 2023
Queue-based Lock (MCS lock)
More details: Figure 5 Algorithms for Scalable Synchronization on Shared Memory Multiprocessor

▪ Create a queue of waiters

- Each thread allocates a local space on which to wait
▪ Pseudo-code:
- glock – global lock (tail of queue)
- mlock – my lock (state, next pointer)
AcquireQLock(*glock, *mlock) ReleaseQLock(*glock, *mlock)
{ {
mlock->next = NULL; do {
mlock->state = UNLOCKED; if (mlock->next == NULL) {
ATOMIC(); x = CMPXCHG(glock, mlock, NULL); **
prev = glock if (x == mlock) return;
*glock = mlock
Atomic Swap }
END_ATOMIC(); else
if (prev == NULL) {
return; mlock->next->state = UNLOCKED;
mlock->state = LOCKED; return;
prev->next = mlock; }
while (mlock->state == LOCKED) ; } while (1);
// SPIN }
} CMU 15-418/618,
**Note the semantics of cmpxchg from previous slide Fall 2023
Quiz Time

CMU 15-418/618,
Fall 2023
Implementing Barriers

CMU 15-418/618,
Fall 2023
Implementing a centralized barrier
(Based on shared counter)

struct Barrier_t {
LOCK lock;
int counter; // initialize to 0
int flag; // the flag field should probably be padded to
// sit on its own cache line. Why?
};

// barrier for p processors

void Barrier(Barrier_t* b, int p) {
lock(b->lock);
if (b->counter == 0) {
b->flag = 0; // first thread arriving at barrier clears flag
}
int num_arrived = ++(b->counter);
unlock(b->lock);

if (num_arrived == p) { // last arriver sets flag Does it work? Consider:

b->counter = 0; do stuff ...
b->flag = 1; Barrier(b, P);
}
else {
do more stuff ...
while (b->flag == 0); // wait for flag Barrier(b, P);
}
}

CMU 15-418/618,
Fall 2023
Correct centralized barrier
struct Barrier_t {
LOCK lock;
int arrive_counter; // initialize to 0 (number of threads that have arrived)
int leave_counter; // initialize to P (number of threads that have left barrier)
int flag;
};

// barrier for p processors

void Barrier(Barrier_t* b, int p) {
lock(b->lock);
if (b->arrive_counter == 0) { // if first to arrive...
if (b->leave_counter == P) { // check to make sure no other threads “still in barrier”
b->flag = 0; // first arriving thread clears flag
} else {
unlock(lock);
while (b->leave_counter != P); // wait for all threads to leave before clearing
lock(lock);
b->flag = 0; // first arriving thread clears flag
}
}
int num_arrived = ++(b->arrive_counter);
unlock(b->lock);

if (num_arrived == p) { // last arriver sets flag

b->arrive_counter = 0;
Main idea: wait for all processes to
b->leave_counter = 1;
b->flag = 1;
leave first barrier, before clearing
}
else {
flag for entry into the second
while (b->flag == 0); // wait for flag
lock(b->lock);
b->leave_counter++;
unlock(b->lock);
}
} CMU 15-418/618,
Fall 2023
Centralized barrier with sense reversal
struct Barrier_t {
LOCK lock;
int counter; // initialize to 0
int flag; // initialize to 0
};

int local_sense = 0; // private per processor. Main idea: processors wait for flag
// to be equal to local sense

// barrier for p processors

void Barrier(Barrier_t* b, int p) {
local_sense = (local_sense == 0) ? 1 : 0;
lock(b->lock);
int num_arrived = ++(b->counter);
if (num_arrived == p) { // last arriver sets flag
unlock(b->lock);
b->counter = 0;
b->flag = local_sense;
}
else {
unlock(b->lock);
while (b.flag != local_sense); // wait for flag
}

Sense reversal optimization results in one spin instead of two

CMU 15-418/618,
Fall 2023
Centralized barrier: traffic
▪ O(P) traffic on interconnect per barrier:
- All threads: 2P write transactions to obtain barrier lock and update counter
(O(P) traffic assuming lock acquisition is implemented in O(1) manner)
- Last thread: 2 write transactions to write to the flag and reset the counter
(O(P) traffic since there are many sharers of the flag)

- P-1 transactions to read updated flag

▪ But there is still serialization on a single shared lock

- So span (latency) of entire operation is O(P)
- Can we do better?

CMU 15-418/618,
Fall 2023
Combining tree implementation of barrier
High contention!
(e.g., single barrier
lock and counter)

Centralized Barrier Combining Tree Barrier

▪ Combining trees make better use of parallelism in interconnect topologies

- lg(P) span (latency)
-Strategy makes less sense on a bus (all traffic still serialized on single shared bus)
▪ Barrier acquire: when processor arrives at barrier, performs increment of parent counter
- Process recurses to root
▪ Barrier release: beginning from root, notify children of release
CMU 15-418/618,
Fall 2023

Spin Lock Solutions for Multiprocessors
No ratings yet
Spin Lock Solutions for Multiprocessors
27 pages
Implementing Locks for Mutual Exclusion
No ratings yet
Implementing Locks for Mutual Exclusion
4 pages
5 2 Concurrency-Locks
No ratings yet
5 2 Concurrency-Locks
26 pages
Merged 2
No ratings yet
Merged 2
21 pages
Implementing Locks in CSE 306
No ratings yet
Implementing Locks in CSE 306
27 pages
Synchronization and Memory Consistency Guide
No ratings yet
Synchronization and Memory Consistency Guide
81 pages
Advanced OS Synchronization Techniques
No ratings yet
Advanced OS Synchronization Techniques
9 pages
Hardware-Software Synchronization Trade-offs
No ratings yet
Hardware-Software Synchronization Trade-offs
56 pages
Synchronization Techniques in CS 740
No ratings yet
Synchronization Techniques in CS 740
8 pages
Understanding OS Synchronization Issues
No ratings yet
Understanding OS Synchronization Issues
120 pages
Cache Coherence in Parallel Computing
No ratings yet
Cache Coherence in Parallel Computing
31 pages
Resource Sharing and Synchronization Techniques
No ratings yet
Resource Sharing and Synchronization Techniques
6 pages
Operating Systems: Synchronization Techniques
No ratings yet
Operating Systems: Synchronization Techniques
22 pages
Lecture Slide 9 OS Spring 25
No ratings yet
Lecture Slide 9 OS Spring 25
24 pages
Locking
No ratings yet
Locking
26 pages
12 Multithreading Patterns Data Structures
No ratings yet
12 Multithreading Patterns Data Structures
40 pages
Process Synchronization Techniques
No ratings yet
Process Synchronization Techniques
58 pages
Concurrency and Lock-based Data Structures
No ratings yet
Concurrency and Lock-based Data Structures
53 pages
Csci4061 Final Exam Practice Sol
No ratings yet
Csci4061 Final Exam Practice Sol
13 pages
Overview of Multithreading Concepts
No ratings yet
Overview of Multithreading Concepts
6 pages
Read-Write Locks and Concurrency Solutions
No ratings yet
Read-Write Locks and Concurrency Solutions
16 pages
MultiProcessors Tanenbaum BP
No ratings yet
MultiProcessors Tanenbaum BP
29 pages
Parallel Computer Architecture Overview
No ratings yet
Parallel Computer Architecture Overview
18 pages
Sibinesh Removed
No ratings yet
Sibinesh Removed
25 pages
7 Synch
No ratings yet
7 Synch
42 pages
Hardware and Software Synchronization Advanced Computer Architecture COMP 140 Thursday June 26, 2014
No ratings yet
Hardware and Software Synchronization Advanced Computer Architecture COMP 140 Thursday June 26, 2014
33 pages
Implementing Locks and Semaphores in OS
No ratings yet
Implementing Locks and Semaphores in OS
3 pages
5 Scheduling
No ratings yet
5 Scheduling
168 pages
COS 318 Midterm Exam Questions
No ratings yet
COS 318 Midterm Exam Questions
6 pages
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
Process Synchronization and Race Conditions
No ratings yet
Process Synchronization and Race Conditions
8 pages
Critical Section Problem Solutions
No ratings yet
Critical Section Problem Solutions
27 pages
CS439H Midterm Exam Guide
No ratings yet
CS439H Midterm Exam Guide
8 pages
Introduction to Thread Programming
No ratings yet
Introduction to Thread Programming
29 pages
Mesa vs Hoare Semantics in Concurrency
No ratings yet
Mesa vs Hoare Semantics in Concurrency
12 pages
Back To Basics Concurrency Arthur Odwyer
No ratings yet
Back To Basics Concurrency Arthur Odwyer
58 pages
Process Synchronization Techniques Explained
No ratings yet
Process Synchronization Techniques Explained
33 pages
Certificate: MSG-SGKM College Arts, Science and Commerce
No ratings yet
Certificate: MSG-SGKM College Arts, Science and Commerce
29 pages
Multiprocessor Synchronization Guide
No ratings yet
Multiprocessor Synchronization Guide
9 pages
Lab 3
No ratings yet
Lab 3
18 pages
System Programming: Synchronization & Scheduling
No ratings yet
System Programming: Synchronization & Scheduling
31 pages
08.1 Synchronization Mutex C
No ratings yet
08.1 Synchronization Mutex C
38 pages
OS-Session 2.6 To 2.8 (Synchronization Tools)
No ratings yet
OS-Session 2.6 To 2.8 (Synchronization Tools)
39 pages
U5: Multiprocessor Architectures Overview
No ratings yet
U5: Multiprocessor Architectures Overview
47 pages
Understanding Semaphores in OS
No ratings yet
Understanding Semaphores in OS
29 pages
15 Synchronization Hardware 21082024 115748am 24042025 092355am
No ratings yet
15 Synchronization Hardware 21082024 115748am 24042025 092355am
14 pages
9-Operating Systems - Synchronization, Interprocess Communication, Deadlock
No ratings yet
9-Operating Systems - Synchronization, Interprocess Communication, Deadlock
162 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
17 pages
Lec08 Readerwriter
100% (1)
Lec08 Readerwriter
33 pages
Synchronization Techniques in OS Lab
No ratings yet
Synchronization Techniques in OS Lab
20 pages
Operating System Design, Fall 2021: Lab Activity Guidance
No ratings yet
Operating System Design, Fall 2021: Lab Activity Guidance
1 page
Multicore TLP in Computer Architecture
No ratings yet
Multicore TLP in Computer Architecture
22 pages
Understanding Cache Coherence Protocols
No ratings yet
Understanding Cache Coherence Protocols
13 pages
Interprocess Communication Overview
No ratings yet
Interprocess Communication Overview
28 pages
OS QuestionAnswer
No ratings yet
OS QuestionAnswer
9 pages
Week 04 Lecture Chapter 4
No ratings yet
Week 04 Lecture Chapter 4
45 pages
Dts Test
No ratings yet
Dts Test
1 page
Pstack
No ratings yet
Pstack
1 page
09 Indexes2
No ratings yet
09 Indexes2
5 pages
18 Timestampordering
No ratings yet
18 Timestampordering
82 pages
03 Storage1
No ratings yet
03 Storage1
55 pages
Modern Multi-Core Architecture Overview
No ratings yet
Modern Multi-Core Architecture Overview
66 pages
Unit II Semaphore, Barrier
No ratings yet
Unit II Semaphore, Barrier
12 pages
22ECT505D - Multicore Architecture
No ratings yet
22ECT505D - Multicore Architecture
2 pages
M4: Shared Memory Programming With Openmp
No ratings yet
M4: Shared Memory Programming With Openmp
63 pages
OpenMP Programming Guide and Directives
No ratings yet
OpenMP Programming Guide and Directives
59 pages
Unit 3 HPC
No ratings yet
Unit 3 HPC
10 pages
Lecture 8-OpenMP Basics-Continued
No ratings yet
Lecture 8-OpenMP Basics-Continued
53 pages
Chapter6 Comp354
No ratings yet
Chapter6 Comp354
48 pages
Openmp
No ratings yet
Openmp
95 pages
Ps Concurrency
No ratings yet
Ps Concurrency
33 pages
Deadlock Avoidance by The Banker's Algorithm: M N M N
No ratings yet
Deadlock Avoidance by The Banker's Algorithm: M N M N
4 pages
DPC Lab Manual
No ratings yet
DPC Lab Manual
21 pages
OpenMP and C++ Multi-Threading Guide
No ratings yet
OpenMP and C++ Multi-Threading Guide
64 pages
CPP Memory Model-1
No ratings yet
CPP Memory Model-1
144 pages
Note - DCS30102 - Module 3
No ratings yet
Note - DCS30102 - Module 3
21 pages
BCS702 Module 4 PDF
No ratings yet
BCS702 Module 4 PDF
42 pages
OpenMP Pragma Directives
No ratings yet
OpenMP Pragma Directives
4 pages
GPU Architecture and CUDA Overview
No ratings yet
GPU Architecture and CUDA Overview
73 pages
Lecture 12-MPI Collective Communication
No ratings yet
Lecture 12-MPI Collective Communication
53 pages
Set 1 QP
No ratings yet
Set 1 QP
7 pages
4c - Communication
No ratings yet
4c - Communication
32 pages
Drop Box
No ratings yet
Drop Box
1,138 pages
Open MP
No ratings yet
Open MP
28 pages
GPU Microarchitecture Insights via Microbenchmarking
No ratings yet
GPU Microarchitecture Insights via Microbenchmarking
12 pages
3-Ics 311 Padc 2
No ratings yet
3-Ics 311 Padc 2
56 pages
PC Module - 4
No ratings yet
PC Module - 4
34 pages
Python Concurrency For Senior Engineering Interviews
No ratings yet
Python Concurrency For Senior Engineering Interviews
62 pages
Vortex Micro21 Final
No ratings yet
Vortex Micro21 Final
13 pages
Multi Core Architecture and Programming PYQ
No ratings yet
Multi Core Architecture and Programming PYQ
23 pages
OpenMP Synchronization Constructs Explained
No ratings yet
OpenMP Synchronization Constructs Explained
32 pages
The C11 and C++11 Concurrency Model
No ratings yet
The C11 and C++11 Concurrency Model
295 pages

Implementing Synchronization Primitives

Uploaded by

Implementing Synchronization Primitives

Uploaded by

Lecture 16:

▪ In classes like 15-213 or in operating systems, you have

▪ pthreads mutex example

lock: ld R0, mem[addr] // load word into R0

unlock: st mem[addr], #0 // store 0 to address

Problem: data race because LOAD-TEST-STORE is not atomic!

lock: ts R0, mem[addr] // load word into R0

unlock: st mem[addr], #0 // store 0 to address

Attempt to update (t&s fails)

Attempt to update (t&s fails)

Attempt to update (t&s fails)

Attempt to update (t&s fails)

Attempt to update (t&s fails)

Attempt to update (t&s fails)

Bus contention increases amount of

Not shown: bus contention also slows

Figure credit: Culler, Singh, and Gupta CMU 15-418/618,

while (*lock != 0); // while another processor has the lock...

if (test_and_set(*lock) == 0) // when lock is released, try to acquire it

void Unlock(volatile int* lock) {

Update line in cache (set to 1)

Update line in cache (set to 1)

▪ Same uncontended latency as test-and-set, but potentially higher latency under

No atomic operation needed to acquire the lock (only a read)

lock prefix (makes operation atomic)

▪ Create a queue of waiters

// barrier for p processors

if (num_arrived == p) { // last arriver sets flag Does it work? Consider:

// barrier for p processors

if (num_arrived == p) { // last arriver sets flag

// barrier for p processors

Sense reversal optimization results in one spin instead of two

- P-1 transactions to read updated flag

▪ But there is still serialization on a single shared lock

Centralized Barrier Combining Tree Barrier

▪ Combining trees make better use of parallelism in interconnect topologies

You might also like