0% found this document useful (0 votes)
22 views20 pages

Efficient GPU Synchronization Without Scopes: Saying No To Complex Consistency Models

The document discusses the need for efficient GPU synchronization models, arguing against the complexity of current heterogeneous-race-free (HRF) models in favor of a simpler data-race-free (DRF) model with DeNovo coherence. It presents the advantages of using DeNovo+DRF, which offers comparable or better performance with lower complexity compared to existing GPU models. The findings suggest that simpler consistency models can achieve efficient coherence without the overhead associated with more complex models.

Uploaded by

Alnoamie Hadroos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

Efficient GPU Synchronization Without Scopes: Saying No To Complex Consistency Models

The document discusses the need for efficient GPU synchronization models, arguing against the complexity of current heterogeneous-race-free (HRF) models in favor of a simpler data-race-free (DRF) model with DeNovo coherence. It presents the advantages of using DeNovo+DRF, which offers comparable or better performance with lower complexity compared to existing GPU models. The findings suggest that simpler consistency models can achieve efficient coherence without the overhead associated with more complex models.

Uploaded by

Alnoamie Hadroos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Efficient GPU Synchronization without Scopes:

Saying No to Complex Consistency Models

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve


University of Illinois @ Urbana-Champaign
hetero@cs.illinois.edu
Motivation

Heterogeneous systems now used for a wide variety of applications


Emerging applications have fine-grained synchronization

BUT current GPUs have sub-optimal consistency and coherence

Consistency Coherence
Defacto Data-race-free (DRF) High overhead on synchs
Simple Inefficient
Recent Heterogeneous-race-free (HRF) No overhead for local synchs
Scoped synchronization
Complex Efficient for local synch

This work: simple consistency + efficient coherence


2
Motivation (Cont.)

Do GPU models (HRF) need to be more complex than CPU models (DRF)?
NO! Not if coherence is done right!
DeNovo+DRF: Efficient AND simpler memory model
– Comparable or better results vs. GPU+DRF and GPU+HRF

complex
consistency
models 3
Outline

• Motivation
• Coherence Protocols and Consistency Models
– Classification
– GPU Coherence
– DeNovo Coherence
– Coherence and Consistency Summary
• Results
• Conclusion

4
A Classification of Coherence Protocols

• Read hit: Don’t return stale data


• Read miss: Find one up-to-date copy
Invalidator
Writer Reader
Track Ownership
MESI DeNovo
up-to-
date
copy Writethrough GPU

• Reader-initiated invalidations
– No invalidation or ack traffic, directories, transient states
• Obtaining ownership for written data
– Reuse owned data across synchs (not flushed at synch points)
5
GPU Coherence with DRF
GPU CPU
Flush dirty Valid
Invalidate DirtyCache Cache
alldata
data Valid

L2 Cache L2 Cache
Bank Bank

Interconnection n/w

• With data-race-free (DRF) memory model


– No data races; synchs must be explicitly distinguished
– At all synch points
• Flush all dirty data: Unnecessary writethroughs
• Invalidate all data: Can’t reuse data across synch points
– Synchronization accesses must go to last level cache (LLC)
6
GPU Coherence with HRF

heterogeneous HRF
• With data-race-free (DRF) memory model [ASPLOS ’14]
– No data races; synchs must be explicitly distinguished
heterogeneous and their scopes
– At all synch points
global
• Flush all dirty data: Unnecessary writethroughs
• Invalidate all data: Can’t reuse data across synch points
Global
– Synchronization accesses must go to last level cache (LLC)
– No overhead for locally scoped synchs

• But higher programming complexity

7
DeNovo Coherence with DRF
GPU CPU
Invalidate
Obtain Own
DirtyCache Cache
non-owned data Valid
ownership
L2 Cache L2 Cache
Bank Bank

Interconnection n/w

• With data-race-free (DRF) memory model


– No data races; synchs must be explicitly distinguished
– At all synch points
• Flush all dirty data Obtain ownership for dirty data Can reuse
• Invalidate all non-owned data owned data
can be performed at L1
– Synchronization accesses must go to last level cache (LLC)
• 3% state overhead vs. GPU coherence + HRF 8
DeNovo Configurations Studied

• DeNovo+DRF:
– Invalidate all non-owned data at synch points

• DeNovo-RO+DRF:
– Avoids invalidating read-only data at synch points

• DeNovo+HRF:
– Reuse valid data if synch is locally scoped

9
Coherence & Consistency Summary

Coherence + Consistency Reuse Data Do Synchs


Owned Valid at L1
GPU + DRF (GD) X X X
GPU + HRF (GH) local local local
DeNovo + DRF (DD)  X 
DeNovo-RO + DRF (DD+RO)  read-only 

DeNovo + HRF (DH)  local 

10
Outline

• Motivation
• Coherence Protocols and Consistency Models
• Results
• Conclusion

11
Evaluation Methodology

• 1 CPU core + 15 GPU compute units (CU)


– Each node has private L1, scratchpad, tile of shared L2

• Simulation Environment
– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT

• Workloads
– 10 apps from Rodinia, Parboil: no fine-grained synch
• DeNovo and GPU coherence perform comparably
– UC-Davis microbenchmarks + UTS from HRF paper:
• Mutex, semaphore, barrier, work sharing
• Shows potential for future apps
• Created two versions of each: globally, locally/hybrid scoped synch
12
Global Synch – Execution Time

FAM SLM SPM SPMBO AVG


100%

80%

60%

40%

20%

0%
G* D* G* D* G* D* G* D* G* D*

DeNovo has 28% lower execution time than GPU with global synch
13
Global Synch – Energy

N/W L2 $ L1 D$ Scratch GPU Core+


FAM SLM SPM SPMBO AVG
100%

80%

60%

40%

20%

0%
G* D* G* D* G* D* G* D* G* D*

DeNovo has 51% lower energy than GPU with global synch
14
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH

DD

DD

DH

DD

DH

DD

DH

DD

DH

DD

DH

DD

DD

DH

DD

DH
DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH

GD
GH

GD
GH

GD

GD
GH

GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

15
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH

DD

DD

DH

DD

DH

DD

DH

DD

DH

DD

DH

DD

DD

DH

DD

DH
DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH

GD
GH

GD
GH

GD

GD
GH

GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model

16
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH

DD

DD

DH

DD

DH

DD

DH

DD

DH

DD

DH

DD

DD

DH

DD

DH
DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH

GD
GH

GD
GH

GD

GD
GH

GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model
DeNovo-RO+DRF reduces gap by not invalidating read-only data
17
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH

DD

DD

DH

DD

DH

DD

DH

DD

DH

DD

DH

DD

DD

DH

DD

DH
DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH

GD
GH

GD
GH

GD

GD
GH

GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model
DeNovo-RO+DRF reduces gap by not invalidating read-only data
DeNovo+HRF is best, if consistency complexity acceptable 18
Local Synch – Energy

N/W L2 $ L1 D$ Scratch GPU Core+


FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DD
DD

DD

DD

DD

DD

DD

DD

DH

DD

DD
DH

DH

DH

DH

DH

DH

DH

DH

DH
DD+RO

DD+RO

DD+RO

GH

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO

DD+RO
GD
GH

GD
GH

GD
GH

GD

GD
GH

GD
GH

GD
GH

GD
GH

GD
GH

GD
GH
Energy trends similar to execution time
19
Conclusions

• Emerging heterogeneous apps use fine-grained synch


– GPU coherence + DRF: inefficient, but simple memory model
– GPU coherence + HRF: efficient, but complex memory model
Do GPU models (HRF) need to be more complex than CPU models (DRF)?
– DeNovo + DRF: efficient AND simple memory model

complex
consistency
models!
20

You might also like