0% found this document useful (0 votes)

22 views20 pages

Efficient GPU Synchronization Without Scopes: Saying No To Complex Consistency Models

The document discusses the need for efficient GPU synchronization models, arguing against the complexity of current heterogeneous-race-free (HRF) models in favor of a simpler data-race-free (DRF) model with DeNovo coherence. It presents the advantages of using DeNovo+DRF, which offers comparable or better performance with lower complexity compared to existing GPU models. The findings suggest that simpler consistency models can achieve efficient coherence without the overhead associated with more complex models.

Uploaded by

Alnoamie Hadroos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views20 pages

Efficient GPU Synchronization Without Scopes: Saying No To Complex Consistency Models

Uploaded by

Alnoamie Hadroos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Efficient GPU Synchronization without Scopes:

Saying No to Complex Consistency Models

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve

University of Illinois @ Urbana-Champaign
hetero@cs.illinois.edu
Motivation

Heterogeneous systems now used for a wide variety of applications

Emerging applications have fine-grained synchronization

BUT current GPUs have sub-optimal consistency and coherence

Consistency Coherence
Defacto Data-race-free (DRF) High overhead on synchs
Simple Inefficient
Recent Heterogeneous-race-free (HRF) No overhead for local synchs
Scoped synchronization
Complex Efficient for local synch

This work: simple consistency + efficient coherence

2
Motivation (Cont.)

Do GPU models (HRF) need to be more complex than CPU models (DRF)?
NO! Not if coherence is done right!
DeNovo+DRF: Efficient AND simpler memory model
– Comparable or better results vs. GPU+DRF and GPU+HRF

complex
consistency
models 3
Outline

• Motivation
• Coherence Protocols and Consistency Models
– Classification
– GPU Coherence
– DeNovo Coherence
– Coherence and Consistency Summary
• Results
• Conclusion

4
A Classification of Coherence Protocols

• Read hit: Don’t return stale data

• Read miss: Find one up-to-date copy
Invalidator
Writer Reader
Track Ownership
MESI DeNovo
up-to-
date
copy Writethrough GPU

• Reader-initiated invalidations
– No invalidation or ack traffic, directories, transient states
• Obtaining ownership for written data
– Reuse owned data across synchs (not flushed at synch points)
5
GPU Coherence with DRF
GPU CPU
Flush dirty Valid
Invalidate DirtyCache Cache
alldata
data Valid

L2 Cache L2 Cache
Bank Bank

Interconnection n/w

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished
– At all synch points
• Flush all dirty data: Unnecessary writethroughs
• Invalidate all data: Can’t reuse data across synch points
– Synchronization accesses must go to last level cache (LLC)
6
GPU Coherence with HRF

heterogeneous HRF
• With data-race-free (DRF) memory model [ASPLOS ’14]
– No data races; synchs must be explicitly distinguished
heterogeneous and their scopes
– At all synch points
global
• Flush all dirty data: Unnecessary writethroughs
• Invalidate all data: Can’t reuse data across synch points
Global
– Synchronization accesses must go to last level cache (LLC)
– No overhead for locally scoped synchs

• But higher programming complexity

7
DeNovo Coherence with DRF
GPU CPU
Invalidate
Obtain Own
DirtyCache Cache
non-owned data Valid
ownership
L2 Cache L2 Cache
Bank Bank

Interconnection n/w

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished
– At all synch points
• Flush all dirty data Obtain ownership for dirty data Can reuse
• Invalidate all non-owned data owned data
can be performed at L1
– Synchronization accesses must go to last level cache (LLC)
• 3% state overhead vs. GPU coherence + HRF 8
DeNovo Configurations Studied

• DeNovo+DRF:
– Invalidate all non-owned data at synch points

• DeNovo-RO+DRF:
– Avoids invalidating read-only data at synch points

• DeNovo+HRF:
– Reuse valid data if synch is locally scoped

9
Coherence & Consistency Summary

Coherence + Consistency Reuse Data Do Synchs

Owned Valid at L1
GPU + DRF (GD) X X X
GPU + HRF (GH) local local local
DeNovo + DRF (DD)  X 
DeNovo-RO + DRF (DD+RO)  read-only 

DeNovo + HRF (DH)  local 

10
Outline

• Motivation
• Coherence Protocols and Consistency Models
• Results
• Conclusion

11
Evaluation Methodology

• 1 CPU core + 15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

• Simulation Environment
– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT

• Workloads
– 10 apps from Rodinia, Parboil: no fine-grained synch
• DeNovo and GPU coherence perform comparably
– UC-Davis microbenchmarks + UTS from HRF paper:
• Mutex, semaphore, barrier, work sharing
• Shows potential for future apps
• Created two versions of each: globally, locally/hybrid scoped synch
12
Global Synch – Execution Time

FAM SLM SPM SPMBO AVG

100%

80%

60%

40%

20%

0%
G* D* G* D* G* D* G* D* G* D*

DeNovo has 28% lower execution time than GPU with global synch
13
Global Synch – Energy

N/W L2 $ L1 D$ Scratch GPU Core+

FAM SLM SPM SPMBO AVG
100%

80%

60%

40%

20%

0%
G* D* G* D* G* D* G* D* G* D*

DeNovo has 51% lower energy than GPU with global synch
14
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH
DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

15
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH
DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model

16
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH
DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

GD
GH
GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]
DeNovo+DRF comparable to GPU+HRF, but simpler consistency model
DeNovo-RO+DRF reduces gap by not invalidating read-only data
17
Local Synch – Execution Time

GD GH DD DD+RO DH
FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DH

DH
DD

DH
DD+RO

DD+RO

DD+RO
GH
GD
GH

GD
GH

N/W L2 $ L1 D$ Scratch GPU Core+

FAM SLM SPM SPMBO SS SSBO TBEX TB UTS AVG
100%

80%

60%

40%

20%

0%
DD
DD

DD
DH

DH
DD+RO

DD+RO

DD+RO
GD
GH

GD
GH

GD
GH
Energy trends similar to execution time
19
Conclusions

• Emerging heterogeneous apps use fine-grained synch

– GPU coherence + DRF: inefficient, but simple memory model
– GPU coherence + HRF: efficient, but complex memory model
Do GPU models (HRF) need to be more complex than CPU models (DRF)?
– DeNovo + DRF: efficient AND simple memory model

complex
consistency
models!
20

Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
TLPGNN and Qagnn
No ratings yet
TLPGNN and Qagnn
17 pages
CUDA 9 Features and Tesla V100 Overview
100% (1)
CUDA 9 Features and Tesla V100 Overview
45 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
Advanced CUDA Techniques for Many-Core Computing
No ratings yet
Advanced CUDA Techniques for Many-Core Computing
64 pages
CUDA For Tegra AppNote
No ratings yet
CUDA For Tegra AppNote
60 pages
Optimizing GPU Performance Metrics
No ratings yet
Optimizing GPU Performance Metrics
9 pages
GPU Graph Algorithms with CUDA
No ratings yet
GPU Graph Algorithms with CUDA
26 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
Optimizing CUDA Memory Hierarchy
No ratings yet
Optimizing CUDA Memory Hierarchy
77 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
Levental Uchicago 0330D 17419
No ratings yet
Levental Uchicago 0330D 17419
163 pages
Heterogeneous CPU-GPU Computing Insights
No ratings yet
Heterogeneous CPU-GPU Computing Insights
83 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Energy-Efficient DNN Accelerator Design
No ratings yet
Energy-Efficient DNN Accelerator Design
147 pages
Lattice2024 Wagner
No ratings yet
Lattice2024 Wagner
38 pages
Christopher Noel Hesse
No ratings yet
Christopher Noel Hesse
103 pages
IEEE - HiPC 2023
No ratings yet
IEEE - HiPC 2023
2 pages
高通Opencl
No ratings yet
高通Opencl
116 pages
Synthesis Gpgpu Draft2012 09
No ratings yet
Synthesis Gpgpu Draft2012 09
100 pages
GPU Verification Iccad18-Gpu
No ratings yet
GPU Verification Iccad18-Gpu
8 pages
Analysis and Comparison of Performance and Power Consumption of Neural Networks On CPU, GPU, TPU and FPGA - Christopher - Noel - Hesse
No ratings yet
Analysis and Comparison of Performance and Power Consumption of Neural Networks On CPU, GPU, TPU and FPGA - Christopher - Noel - Hesse
103 pages
11 Lecture 25 02 25
No ratings yet
11 Lecture 25 02 25
42 pages
Efficient FPGA CNN Inference System
No ratings yet
Efficient FPGA CNN Inference System
4 pages
Deep Learning Frameworks & Techniques
No ratings yet
Deep Learning Frameworks & Techniques
5 pages
Heterogeneous System Coherence For Integrated CPU-GPU Systems
No ratings yet
Heterogeneous System Coherence For Integrated CPU-GPU Systems
11 pages
AI for Forest Fire Detection
No ratings yet
AI for Forest Fire Detection
11 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Data in Memory Analytics Framework
No ratings yet
Data in Memory Analytics Framework
27 pages
The Design and Implementation of A Verification Technique For GPU Kernels
No ratings yet
The Design and Implementation of A Verification Technique For GPU Kernels
49 pages
Tesi
No ratings yet
Tesi
73 pages
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
No ratings yet
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
20 pages
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
No ratings yet
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
6 pages
An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis
No ratings yet
An Efficient Hardware Architecture For Exploiting Sparsity in Neural Networks Master Thesis
63 pages
Article Report Final
No ratings yet
Article Report Final
9 pages
Low-Cost MPSoC for Real-Time Object Tracking
No ratings yet
Low-Cost MPSoC for Real-Time Object Tracking
2 pages
5th AccML Paper 1
No ratings yet
5th AccML Paper 1
6 pages
Cortex 2 0 Whitepaper EN
No ratings yet
Cortex 2 0 Whitepaper EN
60 pages
Serving Large Language Models on Huawei CloudMatrix384（华为GPU+DEEPSEEK）
No ratings yet
Serving Large Language Models on Huawei CloudMatrix384（华为GPU+DEEPSEEK）
59 pages
Deep Learning: Data vs. Model Parallelism
No ratings yet
Deep Learning: Data vs. Model Parallelism
5 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Laius: An 8-Bit Fixed-Point CNN Hardware Inference Engine
No ratings yet
Laius: An 8-Bit Fixed-Point CNN Hardware Inference Engine
8 pages
Energy-Efficient FPGA CNN Accelerator
No ratings yet
Energy-Efficient FPGA CNN Accelerator
4 pages
GPU-Based Parallel Metaheuristics
No ratings yet
GPU-Based Parallel Metaheuristics
81 pages
Deterministic ML in Molecular Dynamics
No ratings yet
Deterministic ML in Molecular Dynamics
68 pages
Jlpea 12 00011 v3
No ratings yet
Jlpea 12 00011 v3
16 pages
Serving Large Language Models On Huawei Cloudmatrix384
No ratings yet
Serving Large Language Models On Huawei Cloudmatrix384
58 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Insights for CPU Experts
100% (1)
GPU Insights for CPU Experts
70 pages
Set A
No ratings yet
Set A
20 pages
FPGA-Based CNN Prototyping Framework
No ratings yet
FPGA-Based CNN Prototyping Framework
65 pages
OpenMP 4.0 Code Generation for OmpSs
No ratings yet
OpenMP 4.0 Code Generation for OmpSs
74 pages
Owens
No ratings yet
Owens
67 pages
Understanding Cache Coherence in Parallel Architectures
No ratings yet
Understanding Cache Coherence in Parallel Architectures
74 pages
Lecture 9 Aneamia and Cardiovascular Disorders 2016
No ratings yet
Lecture 9 Aneamia and Cardiovascular Disorders 2016
33 pages
Lec 2 Ans 2nd Semester 2015
No ratings yet
Lec 2 Ans 2nd Semester 2015
40 pages
Jensen Student Chapter 1
No ratings yet
Jensen Student Chapter 1
40 pages
Second Summary PDF
No ratings yet
Second Summary PDF
8 pages
Week 9 PDF
No ratings yet
Week 9 PDF
9 pages
Week 10 PDF
No ratings yet
Week 10 PDF
5 pages
Lecture 5 Psychological and Physiological Changes During Pregnancy 1
No ratings yet
Lecture 5 Psychological and Physiological Changes During Pregnancy 1
64 pages
Lecture 4 Conception Fertilization and Fetal Development 1
No ratings yet
Lecture 4 Conception Fertilization and Fetal Development 1
73 pages
Lecture 1 Maternal Healthand Reproductive Health Concepts
No ratings yet
Lecture 1 Maternal Healthand Reproductive Health Concepts
25 pages
Lecture 13 Nursing Care of A Family Experiencing A Sudden Pregnancy Complication 2
No ratings yet
Lecture 13 Nursing Care of A Family Experiencing A Sudden Pregnancy Complication 2
109 pages
Lecture 7 High Risk Bleeding 1
No ratings yet
Lecture 7 High Risk Bleeding 1
80 pages
New Born Lab Fay 2021
No ratings yet
New Born Lab Fay 2021
36 pages
Professional Relationship Issues
No ratings yet
Professional Relationship Issues
27 pages
Lectures 16 Infertility
No ratings yet
Lectures 16 Infertility
47 pages
Ethical Principles 1
No ratings yet
Ethical Principles 1
45 pages
Professional Issues
No ratings yet
Professional Issues
43 pages
Practice Issues Related To PT Self Determination CH 11 1 1
No ratings yet
Practice Issues Related To PT Self Determination CH 11 1 1
35 pages
Spinal Cord Injury 2020 2021 2
No ratings yet
Spinal Cord Injury 2020 2021 2
53 pages
Practice Issues Related To Technology
No ratings yet
Practice Issues Related To Technology
36 pages
Social Issues
No ratings yet
Social Issues
17 pages
2024.05.09 HRF GloMag Sanctions Slides
No ratings yet
2024.05.09 HRF GloMag Sanctions Slides
20 pages
Guidelines For Precast Concrete Grease Interceptors
No ratings yet
Guidelines For Precast Concrete Grease Interceptors
48 pages
HPC Laser Engraver Safety Manual
No ratings yet
HPC Laser Engraver Safety Manual
11 pages
Fit-Up and Welding Inspection Report
No ratings yet
Fit-Up and Welding Inspection Report
2 pages
History of Millwork Techniques
No ratings yet
History of Millwork Techniques
26 pages
Calculus I Lecture Notes on Concavity
No ratings yet
Calculus I Lecture Notes on Concavity
2 pages
Sci6 Q1 Wks1-3 Mod1 Homgeneous-vs-Heterogeneous-Mixtures Matias (Uploaded)
No ratings yet
Sci6 Q1 Wks1-3 Mod1 Homgeneous-vs-Heterogeneous-Mixtures Matias (Uploaded)
21 pages
Whether Entering A Cave For Exploration or For Res
No ratings yet
Whether Entering A Cave For Exploration or For Res
1 page
F1-17 Orifice & Free Jet Flow Lab Manual
No ratings yet
F1-17 Orifice & Free Jet Flow Lab Manual
29 pages
ECHA Note: Sterile Controls in Biodegradation Studies - Current Status in Regulatory Testing in Persistence Assessment Under REACH
No ratings yet
ECHA Note: Sterile Controls in Biodegradation Studies - Current Status in Regulatory Testing in Persistence Assessment Under REACH
58 pages
Book of Romans Background
91% (22)
Book of Romans Background
15 pages
Cement Quality Control & Assurance Guide
No ratings yet
Cement Quality Control & Assurance Guide
16 pages
Singer 3116 Simple Instruction Manual 121397
No ratings yet
Singer 3116 Simple Instruction Manual 121397
45 pages
Geometry Part 1
No ratings yet
Geometry Part 1
14 pages
Wa0005
No ratings yet
Wa0005
24 pages
Gruenberg Depyrogenation Oven PDF
No ratings yet
Gruenberg Depyrogenation Oven PDF
4 pages
BL Epss Lagos
No ratings yet
BL Epss Lagos
187 pages
Cerulean Seas, Celadon Shores
100% (5)
Cerulean Seas, Celadon Shores
138 pages
Advancements in Magnesium Alloys For Lightweight and High-Performance Applications
No ratings yet
Advancements in Magnesium Alloys For Lightweight and High-Performance Applications
5 pages
Motor Starter Interview Questions English
No ratings yet
Motor Starter Interview Questions English
3 pages
BMW 1 Series - Pdf.asset.1618548287287 PDF
No ratings yet
BMW 1 Series - Pdf.asset.1618548287287 PDF
15 pages
Reaction Engineering Lab Report Guide
100% (1)
Reaction Engineering Lab Report Guide
46 pages
Allied - 3265 - Microbiology - Mittp (May-2024) - May-2024 (Apr-24)
No ratings yet
Allied - 3265 - Microbiology - Mittp (May-2024) - May-2024 (Apr-24)
2 pages
PD Cen Iso TR 26369-2009
No ratings yet
PD Cen Iso TR 26369-2009
46 pages
2019 Laser Measuring Tools Catalog
No ratings yet
2019 Laser Measuring Tools Catalog
144 pages
IJPRABriefreviewon Total Quality Managementin Pharmaceutical Industries
No ratings yet
IJPRABriefreviewon Total Quality Managementin Pharmaceutical Industries
9 pages
Biology Quiz: Molecule Movement & Organisms
No ratings yet
Biology Quiz: Molecule Movement & Organisms
4 pages
CH 05 Conservation of Ecosystem
No ratings yet
CH 05 Conservation of Ecosystem
9 pages
Cereals, Pulses, Legumes and Vegetable Proteins
No ratings yet
Cereals, Pulses, Legumes and Vegetable Proteins
116 pages
Polygon
No ratings yet
Polygon
25 pages
BTEC HND Networking Assignment Guide
No ratings yet
BTEC HND Networking Assignment Guide
74 pages

Efficient GPU Synchronization Without Scopes: Saying No To Complex Consistency Models

Uploaded by

Efficient GPU Synchronization Without Scopes: Saying No To Complex Consistency Models

Uploaded by

Efficient GPU Synchronization without Scopes:

Saying No to Complex Consistency Models

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve

Heterogeneous systems now used for a wide variety of applications

BUT current GPUs have sub-optimal consistency and coherence

This work: simple consistency + efficient coherence

• Read hit: Don’t return stale data

• With data-race-free (DRF) memory model

• But higher programming complexity

• With data-race-free (DRF) memory model

Coherence + Consistency Reuse Data Do Synchs

DeNovo + HRF (DH)  local 

• 1 CPU core + 15 GPU compute units (CU)

FAM SLM SPM SPMBO AVG

N/W L2 $ L1 D$ Scratch GPU Core+

N/W L2 $ L1 D$ Scratch GPU Core+

• Emerging heterogeneous apps use fine-grained synch

You might also like