Welcome to Scribd!

Skip carousel

Lec 3

Uploaded by

zrashad04

0% found this document useful (0 votes)

5 views48 pages

Original Title

Lec3

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

5 views48 pages

Lec 3

Uploaded by

zrashad04

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 48

Search inside document

CHW 461:

Parallel Architectures
Lecture # 4
GPU System Context
GPU Computing?
 Design target for CPUs:
 Make a single thread very fast
 Take control away from programmer

 GPU Computing takes a different approach:

 Throughput matters- single threads do not
 Give explicit control to programmer
“CPU-style" Cores
Slimming down
More Space: Double the Number of Cores
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Saving Yet More Space

Idea #2
 Amortize cost/complexity of managing an
instruction stream across many ALUs.

→ SIMD
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Gratuitous Amounts of Parallelism!
Branches
Memory
 Memory latency: The time taken for a memory request to be completed.
This usually takes 100s of cycles.

 Memory bandwidth: The rate at which the memory system can provide
data to a processor.

 Stalling: Occurs when a processor cannot continue to execute code

because of a dependency on a previous instruction. To continue with the
current instruction, the processor must wait until the previous instruction
is completed. Stalling can occur when we have to load memory.
Remaining Problem: Slow Memory
 Problem
 Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.

We've removed
 caches
 branch prediction
 out-of-order execution
So what now?
Remaining Problem: Slow Memory
 Problem
 Memory still has very high latency. . .
. . . but we've removed most of the hardware that helps us deal
with that.

We've removed
 caches
 branch prediction
 out-of-order execution
So what now?
Hiding Memory Latency
Discussion !!
 Multi-threading increases /decreases time for individual thread to
finish assigned task?!

 Does multithreading improve throughput?

 Does multithreading improve performance?

 The time to complete all the tasks should increase/decrease ?! Why?!

 Multi-threading requires a lot of memory bandwidth?! Explain?!

GPU Architecture Summary
 Core Ideas:
1. Many slimmed down cores
→ lots of parallelism.

2. More ALUs, Fewer Control Units.

3. Avoid memory stalls by interleaving execution of SIMD groups

Two Main Goals
• Maintain execution speed of old sequential programs.→ CPU

• Increase throughput of parallel programs. → GPU

CPU is optimized for sequential
code performance

Almost 10x the bandwidth of multicore

(relaxed memory model)
A Quick Glimpse on Flynn Classification
• A taxonomy of computer architecture.

• Proposed by Micheal Flynn in 1966.

• It is based two things:

– Instructions
– Data
Which one
is closest to
GPU?
Problems Faced by GPUs
• Need enough parallelism.

• Under-utilization.

• Bandwidth to CPU
Modern GPU Hardware
 GPUs have
 many parallel execution units and
 higher transistor counts,
 while CPUs have
 few execution units and
 higher clock speeds

• GPUs have much deeper pipelines (several thousand stages vs 10-20 for
CPUs)
• GPUs have significantly faster and more advanced memory interfaces as
they need to shift around a lot more data than CPUs
Let’s Take A Closer Look:
The Hardware
GPU Architecture: GeForce 8800 (2007)

➢ Each SM is capable of supporting thousands of concurrent hardware threads, up to 2048

on modern architecture GPUs.

➢ The SM performs all the thread management including creation, scheduling and barrier
synchronization.

➢ The SM employs a SIMT (Single Instruction, Multiple Thread) architecture to efficiently

manage the large number of threads that exist. 44
• Much higher bandwidth than typical system memory
• A bit slower than typical system memory
SPs within SM share control
• Communication between GPU memory and system
logic and instruction cache memory is slow
Streaming
Processor (SP) Streaming Multiprocessor
(SM)

45
Scalar vs Threaded
Scalar program
float A[4][8];
for(int i=0;i<4;i++){
for(int j=0;j<8;j++){
A[i][j]++;
}
}
Multithreaded: (4x1) blocks – (8x1) threads
Multithreaded: (2x2) blocks – (4x2) threads

SAP MM Configuration Checklist
Document4 pages
SAP MM Configuration Checklist
Prashant Kumar
No ratings yet
Value Stream Mapping: Delphi Manufacturing System (DMS)
Document68 pages
Value Stream Mapping: Delphi Manufacturing System (DMS)
lam nguyen
No ratings yet
Gpgpu Final
Document124 pages
Gpgpu Final
Sibghat Rehman
No ratings yet
Socunit 1
Document65 pages
Socunit 1
Sooraj Sattiraju
No ratings yet
Cse 216 - L14
Document37 pages
Cse 216 - L14
Gojo Satoru
No ratings yet
GPU Architecture: Alan Gray Epcc The University of Edinburgh
Document30 pages
GPU Architecture: Alan Gray Epcc The University of Edinburgh
Ildson Leno
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
Document42 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
Arushi Mittal
No ratings yet
Lecture 9-10 Computer Organization and Architecture
Document25 pages
Lecture 9-10 Computer Organization and Architecture
shashank kumar
No ratings yet
Lecture 10
Document34 pages
Lecture 10
MAIMONA KHALID
No ratings yet
CS 303 Chapter1, Lecture 3
Document18 pages
CS 303 Chapter1, Lecture 3
HARSH MITTAL
No ratings yet
Chapter 2 - Computer Organization
Document30 pages
Chapter 2 - Computer Organization
adhasbdi2e98qy928e
No ratings yet
AD Up Dig Design Be A
Document130 pages
AD Up Dig Design Be A
Tutul Banerjee
No ratings yet
217 Lec1
Document35 pages
217 Lec1
palash
No ratings yet
The Central Processing Unit:: What Goes On Inside The Computer
Document42 pages
The Central Processing Unit:: What Goes On Inside The Computer
Mag Creation
No ratings yet
Implementation of DSP Algorithms
Document20 pages
Implementation of DSP Algorithms
s tharun
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
Document84 pages
Introduction To Programming Massively Parallel Graphics Processors
djrive
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
Document36 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
Aashish
No ratings yet
Hyper Threading: Concepts & Architecture
Document28 pages
Hyper Threading: Concepts & Architecture
zainvi.sf6018
No ratings yet
L7 Multicore 1
Document50 pages
L7 Multicore 1
AsHraf G. ElrawEi
No ratings yet
Pankaj
Document27 pages
Pankaj
sanjeev2838
No ratings yet
Parralel Demro 001
Document45 pages
Parralel Demro 001
demro channel
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
Document46 pages
Overview of Parallel Computing: Shawn T. Brown
Karthik Kusuma
No ratings yet
2016defcon Intro To Disassembly Workshop PDF
Document324 pages
2016defcon Intro To Disassembly Workshop PDF
Anonymous 1FVnQG
No ratings yet
Different Memory and Variable Types: Prof Wes Armour Wes - Armour@eng - Ox.ac - Uk
Document41 pages
Different Memory and Variable Types: Prof Wes Armour Wes - Armour@eng - Ox.ac - Uk
Denis
No ratings yet
Chapter 9 COA
Document31 pages
Chapter 9 COA
Jijo Xuseen
No ratings yet
ITEC582 Chapter18
Document36 pages
ITEC582 Chapter18
Ana Clara Cavalcante Sousa
No ratings yet
2.2 DD2356 Threads
Document22 pages
2.2 DD2356 Threads
Daniel Araújo
No ratings yet
Mines Paristech / Cri Lal / Cnrs / In2P3
Document37 pages
Mines Paristech / Cri Lal / Cnrs / In2P3
RishatKhan
No ratings yet
09 - Thread Level Parallelism
Document34 pages
09 - Thread Level Parallelism
Suganya Periasamy
50% (2)
L 3 GPU
Document33 pages
L 3 GPU
fdfs
No ratings yet
Chapter 3 Processes
Document42 pages
Chapter 3 Processes
Lusi ሉሲ
No ratings yet
Multicore Computers
Document21 pages
Multicore Computers
mikiasyimer7362
No ratings yet
Memory Cache
Document18 pages
Memory Cache
Funsuk Vangdu
No ratings yet
Multicore Processor
Document15 pages
Multicore Processor
Phani Kumar
No ratings yet
Dual Core Processors: Presented by Prachi Mishra IT - 56
Document16 pages
Dual Core Processors: Presented by Prachi Mishra IT - 56
Prachi Mishra
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
Document60 pages
Part 1 - Lecture 2 - Parallel Hardware
Ahmad Abba
No ratings yet
GPU Fundamentals
Document20 pages
GPU Fundamentals
Jyotirmay Sahu
No ratings yet
Core I5 Report
Document13 pages
Core I5 Report
Amelia Andrea
No ratings yet
IBM Power5 Chip A Dual-Core Multithreaded Processor
Document8 pages
IBM Power5 Chip A Dual-Core Multithreaded Processor
e1s1v09092023
No ratings yet
MPMC (Unit 01) Part 02
Document39 pages
MPMC (Unit 01) Part 02
bovas.biju2021
No ratings yet
Technologies For Network
Document3 pages
Technologies For Network
Christopher Diaz
No ratings yet
Multi Core 15213 Sp07
Document67 pages
Multi Core 15213 Sp07
Nguyen van Truong
No ratings yet
Design Issues: SMT and CMP Architectures
Document9 pages
Design Issues: SMT and CMP Architectures
Senthil Kumar
No ratings yet
Network Processors: Jeffrey Shafer
Document19 pages
Network Processors: Jeffrey Shafer
Michael Yaretsky
No ratings yet
5 - Embedded Systems
Document53 pages
5 - Embedded Systems
حساب ويندوز
No ratings yet
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
Document10 pages
Multi-Core Processor: From Wikipedia, The Free Encyclopedia
Siva Kumar
No ratings yet
Lecture 05 ARM Processors
Document65 pages
Lecture 05 ARM Processors
Nguyễn Tấn Định
No ratings yet
5 Threads PDF
Document34 pages
5 Threads PDF
Ghanshyam Jha
No ratings yet
Chapter01 EKT333 Part2 NA
Document33 pages
Chapter01 EKT333 Part2 NA
Mat Dunlop Bin Senapang Gajah
No ratings yet
Computer System: Operating Systems: Internals and Design Principles
Document62 pages
Computer System: Operating Systems: Internals and Design Principles
Ramadan Elhendawy
No ratings yet
18 Multicore Computers
Document31 pages
18 Multicore Computers
Teguh Setiono
0% (1)
High Performance Networking. Low Latency Devices. 'Network Fabric'
Document40 pages
High Performance Networking. Low Latency Devices. 'Network Fabric'
Jason Wong
No ratings yet
Coursera Lecture 1 1 Hetero 2012
Document10 pages
Coursera Lecture 1 1 Hetero 2012
Karthikeyan Balasubramaniam
No ratings yet
Lec 1
Document27 pages
Lec 1
foof faaf
No ratings yet
Multi-Core Processor
Document20 pages
Multi-Core Processor
ALEXANDRA LONGGANAY
No ratings yet
Chapter 2
Document30 pages
Chapter 2
eyob abel
No ratings yet
CUDA Execution Model
Document67 pages
CUDA Execution Model
AbiMughal
No ratings yet
Military College of Signal1
Document8 pages
Military College of Signal1
Fatima Sheikh
No ratings yet
Microprocessor Core 2 Duo
Document22 pages
Microprocessor Core 2 Duo
Gowtham Kodavati
100% (1)
Threads: Bilkent University Department of Computer Engineering CS342 Operating Systems
Document69 pages
Threads: Bilkent University Department of Computer Engineering CS342 Operating Systems
Muhammed Naci
No ratings yet
Microway Nehalem Whitepaper 2009-06
Document6 pages
Microway Nehalem Whitepaper 2009-06
kuttan268281
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Netcareclient User Manual: Huawei Technologies Co., Ltd. Confidentiality Level V2.0 Internal Document No. Total 49 Pages
Document46 pages
Netcareclient User Manual: Huawei Technologies Co., Ltd. Confidentiality Level V2.0 Internal Document No. Total 49 Pages
Ndrik Elektra
100% (1)
Forensics Block
Document9 pages
Forensics Block
friend
No ratings yet
NTF17 s6.2
Document40 pages
NTF17 s6.2
rajitkarmakar
No ratings yet
P R O T E U S: Introduction and Installation
Document17 pages
P R O T E U S: Introduction and Installation
Paolo Garbin
No ratings yet
What Is Semi
Document2 pages
What Is Semi
Alida Jose
No ratings yet
Shared Memory: Openmp Environment and Synchronization
Document32 pages
Shared Memory: Openmp Environment and Synchronization
karthik reddy
No ratings yet
Patroni
Document137 pages
Patroni
marian moise
No ratings yet
J17 P12 Q9a-C PDF
Document2 pages
J17 P12 Q9a-C PDF
binduann
No ratings yet
(Bookflare - Net) - Cryptors Hacker Manual
Document158 pages
(Bookflare - Net) - Cryptors Hacker Manual
hardoise
80% (5)
ReleaseNotes ERouting v4 0 1
Document21 pages
ReleaseNotes ERouting v4 0 1
devilishalo
No ratings yet
Long Tail Business Model Kzcjgo
Document70 pages
Long Tail Business Model Kzcjgo
Jefferson Cu
No ratings yet
Advanced Search On JIRA and Confluence
Document32 pages
Advanced Search On JIRA and Confluence
Mauro D. Escobar
No ratings yet
Cense
Document14 pages
Cense
teja sri rama murthy
No ratings yet
PEGASUS Spyware: Detecting and Protecting Your Smartphone From
Document8 pages
PEGASUS Spyware: Detecting and Protecting Your Smartphone From
If I
No ratings yet
IIT Timble Attendance App
Document7 pages
IIT Timble Attendance App
ak vishwakarma
No ratings yet
Win With New NCS 560
Document23 pages
Win With New NCS 560
Hermes Alejandro Rodríguez Jiménez
No ratings yet
Seminar Report On Operating System
Document19 pages
Seminar Report On Operating System
ISMAIL Fawas
No ratings yet
Case Study Whistle-Blower Divides IT Security Community: Answer
Document4 pages
Case Study Whistle-Blower Divides IT Security Community: Answer
AS Khan
No ratings yet
System Security Network (Firewall)
Document55 pages
System Security Network (Firewall)
Kresna Sophiana
No ratings yet
Routing Protocol and MPLS Project: Projects Executed Metaswitch Project Details
Document1 page
Routing Protocol and MPLS Project: Projects Executed Metaswitch Project Details
Roha Cbc
No ratings yet
Cheat Sheet - DBMS
Document5 pages
Cheat Sheet - DBMS
Kaito
No ratings yet
ASAM - AE - MCD 1 XCP - AS - SxI Transport Layer - V1 5 0
Document18 pages
ASAM - AE - MCD 1 XCP - AS - SxI Transport Layer - V1 5 0
cruse2015
100% (1)
Multipath Subsystem Device Driver User's
Document526 pages
Multipath Subsystem Device Driver User's
Orhan Idiriz
No ratings yet
Readme
Document9 pages
Readme
Cristian M Salas
No ratings yet
Samsung Ps-50q91hddx PDP TV
Document68 pages
Samsung Ps-50q91hddx PDP TV
Neil Stock
No ratings yet
HP Laserjet Text Codes A-Z
Document6 pages
HP Laserjet Text Codes A-Z
adycrs
No ratings yet
Class 10 Maths CBSE PYQ Chapter Wise Topic Wise
Document188 pages
Class 10 Maths CBSE PYQ Chapter Wise Topic Wise
Shib sankar
No ratings yet
Obelisk: MIDI Harmoniser Plugin
Document3 pages
Obelisk: MIDI Harmoniser Plugin
Daniel Barboza
No ratings yet