Assignment

1
Lahore Garrison University

Parallel and Distributed
Computing
Session Fall-2020
Lecture – 05 Week – 03
INSTRUCTOR: MUHAMMAD ARSALAN RAZA
2

NAME: EMAIL: PHONE NUMBER: VISITING HOURS:
MUHAMMAD ARSALAN RAZA MARSALANRAZA@LGU.EDU.P 0333-7843180 (PRIMARY) MON-FRIDAY:(12:30PM-3:00PM)
K 0321-4069289 THURSDAY: (9:30PM-2:30PM)
3
Preamble
 Fault Tolerance
 Types
 Faulty Systems

4

 GPU Architectures
 Conventional CPU architecture
 Modern GPGPU architectures
Lesson Plan  AMD Southern Islands GPU Architecture
 Nvidia Fermi GPU Architecture
 Cell Broadband Engine
5
Conventional CPU Architecture
 CPUs are optimized to minimize the latency of a

single thread.
 Can efficiently handle control flow intensive workloads.
 Lots of space devoted to caching and control
logic.
 Multi-level caches used to avoid latency.
 Limited number of registers due to smaller
number of active threads.
 Control logic to reorder execution, provide
Instruction Level Parallelism and minimize
pipeline stalls.
6
Modern GPGPU Architecture
 Array of independent “cores” called Compute Units.

 High bandwidth, bankedL2 caches and main memory.
 Banks allow multiple accesses to occur in parallel
 100s of GB/s
 Memory and caches are generally non-coherent.
 Compute units are based on SIMD hardware.
 Both AMD and NVIDIA have 16-element wide SIMDs
 Large register files are used for fast context switching
 No saving/restoring state
 Data is persistent for entire thread execution
 Both vendors have a combination of automatic L1 cache and a user managed scratchpad
 Scratchpad is heavily banked and very high bandwidth (~terabytes/second)

7
Modern GPU Architecture
 Work-items are automatically grouped into hardware threads called

“wavefronts” (AMD)
or “warps” (NVIDIA)
 Single instruction stream executed on SIMD hardware
 64 work-items in a wavefront, 32 in a warp
 Instruction is issued multiple times on 16-lane SIMD unit
 Control flow is handled by masking SIMD lanes

8
AMD Southern Islands
 7000-series GPUs based on Graphics Core Next (GCN) architecture

 4 SIMDs per compute unit
 1 Scalar Unit to handle instructions common to wavefront
 Loop iterators, constant variable accesses, branches
 Has a single, integer-only ALU unit
 Separate branch unit used for some conditional instructions
 Radeon HD7970
 32 compute units
 Max performance

9
AMD
Southern
Islands
Architecture

10
NVIDIA Fermi Architecture
 GTX 480 - Compute 2.0 capability

 15 cores or Streaming Multiprocessors (SMs)
 Each SM features 32 CUDA processors
 480 CUDA processors
 Global memory with ECC
 SM executes threads in groups of 32 called warps.
 Two warp issue units per SM
 Concurrent kernel execution
 Execute multiple kernels simultaneously to improve efficiency
 CUDA core consists of a single ALU and floating-point unit FPU

11
NVIDIA Fermi
Architecture

12
Cell Broadband Engine
 Developed by Sony, Toshiba, IBM

 Transitioned from embedded platforms into HPC via the Playstation 3
 OpenCL drivers available for Cell Bladecenter servers
 Consists of a Power Processing Element (PPE) and multiple Synergistic Processing
Elements (SPE)
 Uses the IBM XL C for OpenCL compiler

13
Cell Broadband Engine Architecture

14
Lesson Review
 GPU Architectures
 Conventional CPU architecture
 Modern GPGPU architectures
 AMD Southern Islands GPU Architecture
 Nvidia Fermi GPU Architecture
 Cell Broadband Engine

15
Next Lesson Preview
 GPU Programming
 Maryland CPU/GPU Cluster Architecture
 Intel’s response to NVIDIA GPUs
 To Accelerate or not to accelerate
 When is GPUs appropriate?
 CPU vs. GPU hardware design philosophies
 CUDA-capable GPU hardware architecture
 Is it hard to program on a GPU?
 How to program on a GPU today?

16
References
 To cover this topic, different reference material has been used for consultation.
 Textbook:
Distributed Systems: Principles and Paradigms, A. S. Tanenbaum and M. V. Steen,

Prentice Hall, 2nd Edition, 2007.
Distributed and Cloud Computing: Clusters, Grids, Clouds, and the Future Internet, K.
Hwang, J. Dongarra and GC. C. Fox, Elsevier, 1st Ed.
 Google Search Engine

Assignment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment

Uploaded by

Copyright:

Available Formats

1

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

 CPUs are optimized to minimize the latency of a

 Array of independent “cores” called Compute Units.

Lahore Garrison University

 Work-items are automatically grouped into hardware threads called

Lahore Garrison University

 7000-series GPUs based on Graphics Core Next (GCN) architecture

Lahore Garrison University

Lahore Garrison University

 GTX 480 - Compute 2.0 capability

Lahore Garrison University

Lahore Garrison University

 Developed by Sony, Toshiba, IBM

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Distributed Systems: Principles and Paradigms, A. S. Tanenbaum and M. V. Steen,

 Google Search Engine

Lahore Garrison University

You might also like