Efficient GPU Synchronization without Scopes:
Saying No to Complex Consistency Models
Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve
University of Illinois @ Urbana-Champaign
hetero@cs.illinois.edu
Motivation
Heterogeneous systems now used for a wide variety of applications
Emerging applications have fine-grained synchronization
BUT current GPUs have sub-optimal consistency and coherence
Consistency Coherence
Defacto Data-race-free (DRF) High overhead on synchs
Simple Inefficient
Recent Heterogeneous-race-free (HRF) No overhead for local synchs
Scoped synchronization
Complex Efficient for local synch
This work: simple consistency + efficient coherence
2
Motivation (Cont.)
Do GPU models (HRF) need to be more complex than CPU models (DRF)?
NO! Not if coherence is done right!
DeNovo+DRF: Efficient AND simpler memory model
– Comparable or better results vs. GPU+DRF and GPU+HRF
complex
consistency
models 3
Outline
• Motivation
• Coherence Protocols and Consistency Models
– Classification
– GPU Coherence
– DeNovo Coherence
– Coherence and Consistency Summary
• Results
• Conclusion
4
A Classification of Coherence Protocols
• Read hit: Don’t return stale data
• Read miss: Find one up-to-date copy
Invalidator
Writer Reader
Track Ownership
MESI DeNovo
up-to-
date
copy Writethrough GPU
• Reader-initiated invalidations
– No invalidation or ack traffic, directories, transient states
• Obtaining ownership for written data
– Reuse owned data across synchs (not flushed at synch points)
5
GPU Coherence with DRF
GPU CPU
Flush dirty Valid
Invalidate DirtyCache Cache
alldata
data Valid
L2 Cache L2 Cache
Bank Bank
Interconnection n/w
• With data-race-free (DRF) memory model
– No data races; synchs must be explicitly distinguished
– At all synch points
• Flush all dirty data: Unnecessary writethroughs
• Invalidate all data: Can’t reuse data across synch points
– Synchronization accesses must go to last level cache (LLC)
6
GPU Coherence with HRF
heterogeneous HRF
• With data-race-free (DRF) memory model [ASPLOS ’14]
– No data races; synchs must be explicitly distinguished
heterogeneous and their scopes
– At all synch points
global
• Flush all dirty data: Unnecessary writethroughs
• Invalidate all data: Can’t reuse data across synch points
Global
– Synchronization accesses must go to last level cache (LLC)
– No overhead for locally scoped synchs
• But higher programming complexity
7
DeNovo Coherence with DRF
GPU CPU
Invalidate
Obtain Own
DirtyCache Cache
non-owned data Valid
ownership
L2 Cache L2 Cache
Bank Bank
Interconnection n/w
• With data-race-free (DRF) memory model
– No data races; synchs must be explicitly distinguished
– At all synch points
• Flush all dirty data Obtain ownership for dirty data Can reuse
• Invalidate all non-owned data owned data
can be performed at L1
– Synchronization accesses must go to last level cache (LLC)
• 3% state overhead vs. GPU coherence + HRF 8
DeNovo Configurations Studied
• DeNovo+DRF:
– Invalidate all non-owned data at synch points
• DeNovo-RO+DRF:
– Avoids invalidating read-only data at synch points
• DeNovo+HRF:
– Reuse valid data if synch is locally scoped
9
Coherence & Consistency Summary
Coherence + Consistency Reuse Data Do Synchs
Owned Valid at L1
GPU + DRF (GD) X X X
GPU + HRF (GH) local local local
DeNovo + DRF (DD) X
DeNovo-RO + DRF (DD+RO) read-only
DeNovo + HRF (DH) local
10
Outline
• Motivation
• Coherence Protocols and Consistency Models
• Results
• Conclusion
11
Evaluation Methodology
• 1 CPU core + 15 GPU compute units (CU)
– Each node has private L1, scratchpad, tile of shared L2
• Simulation Environment
– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT
• Workloads
– 10 apps from Rodinia, Parboil: no fine-grained synch
• DeNovo and GPU coherence perform comparably
– UC-Davis microbenchmarks + UTS from HRF paper:
• Mutex, semaphore, barrier, work sharing
• Shows potential for future apps
• Created two versions of each: globally, locally/hybrid scoped synch
12
Global Synch – Execution Time
FAM SLM SPM SPMBO AVG
100%
80%
60%
40%
20%
0%
G* D* G* D* G* D* G* D* G* D*
DeNovo has 28% lower execution time than GPU with global synch
13
Global Synch – Energy
N/W L2 $ L1 D$ Scratch GPU Core+
FAM SLM SPM SPMBO AVG
100%
80%
60%
40%
20%
0%
G* D* G* D* G* D* G* D* G* D*
DeNovo has 51% lower energy than GPU with global synch
14
Local Synch – Execution Time
GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%
80%
60%
40%
20%
0%
DH
DH
DD
DH
DD
DD
DH
DD
DH
DD
DH
DD
DH
DD
DH
DD
DD
DH
DD
DH
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GD
GH
GD
GH
GD
GH
GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
15
Local Synch – Execution Time
GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%
80%
60%
40%
20%
0%
DH
DH
DD
DH
DD
DD
DH
DD
DH
DD
DH
DD
DH
DD
DH
DD
DD
DH
DD
DH
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GD
GH
GD
GH
GD
GH
GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model
16
Local Synch – Execution Time
GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%
80%
60%
40%
20%
0%
DH
DH
DD
DH
DD
DD
DH
DD
DH
DD
DH
DD
DH
DD
DH
DD
DD
DH
DD
DH
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GD
GH
GD
GH
GD
GH
GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model
DeNovo-RO+DRF reduces gap by not invalidating read-only data
17
Local Synch – Execution Time
GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%
80%
60%
40%
20%
0%
DH
DH
DD
DH
DD
DD
DH
DD
DH
DD
DH
DD
DH
DD
DH
DD
DD
DH
DD
DH
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GD
GH
GD
GH
GD
GH
GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model
DeNovo-RO+DRF reduces gap by not invalidating read-only data
DeNovo+HRF is best, if consistency complexity acceptable 18
Local Synch – Energy
N/W L2 $ L1 D$ Scratch GPU Core+
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%
80%
60%
40%
20%
0%
DD
DD
DD
DD
DD
DD
DD
DD
DH
DD
DD
DH
DH
DH
DH
DH
DH
DH
DH
DH
DD+RO
DD+RO
DD+RO
GH
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
DD+RO
GD
GH
GD
GH
GD
GH
GD
GD
GH
GD
GH
GD
GH
GD
GH
GD
GH
GD
GH
Energy trends similar to execution time
19
Conclusions
• Emerging heterogeneous apps use fine-grained synch
– GPU coherence + DRF: inefficient, but simple memory model
– GPU coherence + HRF: efficient, but complex memory model
Do GPU models (HRF) need to be more complex than CPU models (DRF)?
– DeNovo + DRF: efficient AND simple memory model
complex
consistency
models!
20