You are on page 1of 64

Performance Optimization

for Software Developers


with

oneAPI & Parallelization on Intel®


Architecture
Edmund Preiss
Business Development Manager
Software and Advanced Technology Group (SATG)
NOTICES AND DISCLAIMERS

Refer to https://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel
software products.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn
more at [intel.com].

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and
roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
© Intel Corporation

SATG 2
Welcome in the Parallelism Area !

Microprocessor frequency Microprocessor frequency


over Time (history) versus Power consumption
… performance is not only the computer architect’s job anymore
… performance increase is increasingly the job of the software developer

SATG 3
Welcome in the Parallelism Area !

Source: © 2014, James Reinders, Intel, used with permission

Microprocessor frequency Microprocessor frequency


over Time (history) versus Power consumption
… performance is not only the computer architect’s job anymore
… performance increase is increasingly the job of the software developer

SATG 4
Amdahl’s Law
“The speedup of a program
using multiple processors
in parallel computing is limited
by the sequential fraction of
the program.”

- Gene Amdahl

Speedup = (s + p ) / (s + p / N )
= 1 / (s + p / N )

SATG 5
The ‘Free Lunch’ is over

“Parallelism == > Performance”


(leads to)

Optimization – Developer to make sure the


above statement becomes reality!

SATG 6
Parallelism on the Intel Architecture

Developer

Tools

Architecture

SATG 7
8
Code Optimization Principles
Stage 1: Use Optimized Libraries

Stage 2: Compile with Architecture-specific


Optimizations

Stage 3: Analysis and Tuning

Stage 4: Check Correctness

SATG 8
oneAPI – A Tools Development Framework

Open Industry Intel Product


Specification

SATG 9
Programming Challenges
for Multiple Architectures Application Workloads Need Diverse Hardware

Scalar Vector Spatial Matrix

Growth in specialized workloads


Middleware & Frameworks

Variety of data-centric hardware required

Requires separate programming models and toolchains CPU GPU FPGA Other accel.
for each architecture programming
model
programming
model
programming
model
programming
models

Software development complexity limits freedom of


architectural choice

CPU GPU FPGA Other accel.

XPUs

SATG 10
Introducing oneAPI
Application Workloads Need Diverse Hardware

Scalar Vector Spatial Matrix

Cross-architecture programming that delivers freedom


to choose the best hardware
Middleware & Frameworks
Based on industry standards and open specifications
Exposes cutting-edge performance features of latest
hardware Industry Intel
Initiative Product

Compatible with existing high-performance languages


and programming models including C++, OpenMP, XPUs
Fortran, and MPI
CPU GPU FPGA Other accel.

SATG 11
oneAPI Industry Initiative Application Workloads Need Diverse Hardware
Break the Chains of Proprietary Lock-in
Middleware & Frameworks

...
A cross-architecture language based on C++ and SYCL
standards
oneAPI Industry Specification
Powerful libraries designed for acceleration of domain- Direct Programming API-Based Programming

specific functions Libraries


DPC++
Math Threading
Library
Low-level hardware abstraction layer Data Parallel C++ Analytics/
ML
DNN ML Comm

Video Processing
Open to promote community and industry collaboration
Low-Level Hardware Interface
Enables code reuse across architectures and vendors
XPUs
The productive, smart path to freedom for
accelerated computing from the economic
and technical burdens of proprietary CPU GPU FPGA Other accel.
programming models

Visit oneapi.com for more details 12


SATG
Data Parallel C++ Standards-based,
DPC++ = ISO C++ and Khronos SYCL
Cross-architectural Language
Parallelism, productivity, and performance for CPUs and
accelerators Direct Programming:
▪ Delivers accelerated computing by exposing hardware features
Data Parallel C++
▪ Allows code reuse across hardware targets, while permitting custom
tuning for specific accelerators
▪ Provides an open, cross-industry solution to single-architecture Community Extensions
proprietary lock-in

Based on C++ and SYCL Khronos SYCL


▪ Delivers C++ productivity benefits, using common, familiar C and C++
constructs
▪ Incorporates SYCL from the Khronos Group to support data parallelism ISO C++
and heterogeneous programming

Community Project to drive language enhancements


▪ Provides extensions to simplify data parallel programming
▪ Continues evolution through open and cooperative development

Apply your skills to the next innovation, not to


rewriting software for the next hardware platform

The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs.
SATG announced a DPC++ compiler that targets Nvidia GPUs.
Codeplay 13
Powerful oneAPI Libraries
▪ Designed for acceleration of key domain-specific
functions Intel® oneAPI Math Kernel
Intel® oneAPI Deep Neural
Network Library
Library oneMKL
oneDNN
▪ Pre-optimized for each target platform for
maximum performance Intel® oneAPI Video Intel® oneAPI Data Analytics
Processing Library Library
oneVPL oneDAL

Intel® oneAPI Threading Intel® oneAPI Collective


Building Blocks Communications Library
oneTBB oneCCL

Intel® oneAPI DPC++ Library


oneDPL

SATG 14
oneAPI – A Tools Development Framework

Open Industry Intel Product


Specification

SATG 15
Intel® oneAPI
Product Application Workloads Need Diverse Hardware

Built on a Rich Heritage of Tools


Based on Intel® Xeon® Processors, Middleware & Frameworks
Now Expanded to XPUs
...
A complete set of advanced compilers,
libraries, and porting, analysis & debugger
tools Intel® oneAPI Product

▪ Accelerates compute by exploiting cutting-edge


hardware features Compatibility Tool Languages Libraries
Analysis & Debug
Tools
▪ Interoperable with existing programming models
and code bases (C++, Fortran, Python, OpenMP, Low-Level Hardware Interface
etc.), developers can be confident that existing
applications work seamlessly with oneAPI XPUs

▪ Eases transitions to new systems and


accelerators⎯using a single code base frees CPU GPU FPGA
developers to invest more time on innovation

Available Now

Visit software.intel.com/oneapi for more details


SATGcapabilities may differ per architecture and custom-tuning will still be required. Other accelerators to be supported in the future.
Some 16
Intel Compiler Transition: Classic to LLVM
Start Your Migration Now
▪ Expand to XPUs Features &
▪ Modern LLVM Infrastructure Performance
Intel® C++ Compiler Classic

Intel® oneAPI DPC++/C++ Compiler (CPU only)


▪ Use for all new projects Intel® oneAPI DPC++/C++ Compiler
▪ Migrate legacy projects
(CPU, GPU, FPGA)

Intel® Fortran Compiler (Beta) Intel® Fortran Compiler Classic

▪ Test drive now & Provide feedback Features &


(CPU only)
Performance
▪ Prepare for migration Intel® Fortran Compiler (Beta)

(CPU, GPU)
Time

Today

Performance, Quality, and Support Continues


SATG 17
Intel® oneAPI Toolkits
A Complete Set of Proven Developer Tools Expanded from CPU to XPU

A core set of high-performance tools for


building C++, Data Parallel C++ applications
& oneAPI library-based applications

Native Code Developers

Intel® oneAPI
Intel® oneAPI Intel® oneAPI Rendering
Tools for HPC Tools for IoT Toolkit
Deliver fast Fortran, Build efficient, reliable Create performant,
OpenMP & MPI solutions that run at high-fidelity visualization
applications that scale network’s edge applications
Specialized Workloads

Intel® AI Analytics Intel® Distribution of


Toolkit OpenVINO™ Toolkit
Accelerate machine learning & Deploy high performance
data science pipelines with inference & applications
optimized DL frameworks & high- Toolkit from edge to cloud
performing Python libraries
Data Scientists & AI Developers

SATG 18
Streamlining Product Line

Composer Professional
Intel® oneAPI Base & HPC Toolkit
Edition Edition
Single-Node

Cluster Intel® oneAPI Base & HPC Toolkit


Edition Multi-Node
CPU (Core & Xeon)

CPU Focused (Core & Xeon) GPU (Intel)


FPGA (Intel)

SATG 19
Intel® oneAPI Intel® oneAPI
Intel Base
oneAPI & HPC
Tools Toolkit
for HPC

Base & HPC Toolkit Direct Programming API-Based Programming Analysis & debug Tools

Intel® C++ Compiler Classic Intel® MPI Library Intel® Inspector

Intel® oneAPI Tools for HPC: Deliver Intel® Fortran Compiler Classic Intel® oneAPI DPC++ Library
Intel® Trace Analyzer &
Collector
Fast Applications that Scale
Intel® Fortran Compiler Intel® oneAPI Math Kernel
Intel® Cluster Checker
(Beta) Library
What is it?
Intel® oneAPI DPC++/C++ Intel® oneAPI Data Analytics
A toolkit that adds to the Intel® oneAPI Base Toolkit for Compiler Intel® VTune™ Profiler
Library
building high-performance, scalable parallel code on
C++, Fortran, OpenMP & MPI from enterprise to cloud, Intel® oneAPI Threading
Intel® DPC++ Compatibility Tool Intel® Advisor
and HPC to AI applications. Building Blocks

Who needs this product? Intel® Distribution for Python* Intel® oneAPI Video Processing
Library
Intel® Distribution for GDB*

▪ OEMs/ISVs
Intel® FPGA Add-on for oneAPI Intel® oneAPI Collective
▪ C++, Fortran, OpenMP, MPI Developers Base Toolkit Communications Library

Why is this important? Intel® oneAPI Deep Neural


Network Library
▪ Accelerate performance on Intel® Xeon® & Core™
Intel® Integrated Performance
Processors and Accelerators Intel® oneAPI HPC Toolkit +
Primitives
▪ Deliver fast, scalable, reliable parallel code with less Intel® oneAPI Base Toolkit

effort; built on industry standards

Learn More: intel.com/oneAPI-HPCKit Back to Domain-specific Toolkits for Specialized Workloads 22


SATG
Render Your Vision in Highest Fidelity
Intel® oneAPI Intel oneAPI Rendering & Ray Tracing Libraries

Rendering Toolkit Intel® Embree


High-Performance, Feature-Rich Ray Tracing &
Photorealistic Rendering 1

Intel® Open Image Denoise


Powerful Libraries for High-Fidelity AI-Accelerated Denoiser for Superior Visual

Visualization Applications
Quality 2

▪ Deliver high-performance, high-fidelity visualization Intel® OpenSWR


applications on Intel® architecture High-Performance, Scalable, OpenGL*-
Compatible Rasterizer 3

▪ Create amazing visual, hyper-realistic renderings via ray


tracing with global illumination Intel® Open Volume Kernel Library
▪ Access all system memory space to create renderings using Render & Simulate 3D Spatial Data Processing
the largest data sets 4

▪ Flexible, cost efficient development Intel® OSPRay


using open source libraries Scalable, Portable, Distributed Rendering API

Intel® OSPRay Studio


Real-time rendering through a graphical user
interface with this new scene graph application
Intel® Embree,
part of Intel® oneAPI Rendering Toolkit, Intel® OSPRay for Hydra
won an Academy Award® Technical Connect the Rendering Toolkit libraries to
Achievement Award in 2021 Universal Scene Description Hydra Rendering
subsystem via plugin 5
1 Avengers: Infinity War - Digital Domain, Marvel Studios, Chaos Group V-Ray
2 Scene courtesy of Frank Meinl
3 Model from Leigh Orf at University of Wisconsin. For more tornado visualization, visit Leigh Orf's site
4 Smoke volume, data courtesy OpenVDB example repository
5 Moana Island Scene, Walt Disney Animation Studios , publicly available dataset: 15fps+,~160 billion prims

Learn
SATG More: intel.com/oneAPI-RenderKit Back to Domain-specific Toolkits for Specialized Workloads 23
Intel® oneAPI
AI Analytics Deep Learning Data Analytics & Machine Learning

Toolkit Intel® Optimization for


TensorFlow
Accelerated Data Frames

Intel® Distribution of Modin OmniSci Backend


Accelerate end-to-end AI and data analytics
pipelines with libraries optimized for Intel® Intel® Optimization for PyTorch
Intel® Distribution for Python
architectures
Intel® Low Precision Optimization
XGBoost Scikit-learn Daal-4Py
Tool

Who Uses It? Model Zoo for Intel® Architecture NumPy SciPy Pandas
Data scientists, AI researchers, ML and DL developers, AI
application developers
Samples and End2End Workloads

Top Features/Benefits CPU GPU

▪ Deep learning performance for training and inference Supported Hardware Architechures1
with Intel optimized DL frameworks and tools
Hardware support varies by individual tool. Architecture support will be expanded over time.
Other names and brands may be claimed as the property of others.
▪ Drop-in acceleration for data analytics and machine
learning workflows with compute-intensive Python
packages Get the Toolkit HERE or via these locations
AI
ANALYTICS
TOOLKIT
Intel Installer Docker Apt, Yum Conda Intel® DevCloud

Learn 24
SATG More: software.intel.com/oneapi/ai-kit Back to Domain-specific Toolkits for Specialized Workloads 24
Intel® Distribution of OpenVINO™ toolkit
Powered by oneAPI

Deliver High-Performance Deep Intel® Distribution of OpenVINO™ toolkit


Learning Inference Deep Learning Traditional Computer Vision
A toolkit to accelerate development of high-performance
deep learning inference & computer vision in vision/AI Intel® Deep Learning Deployment Toolkit Optimized Libraries & Code Samples
applications used from edge to cloud. It enables deep
learning on hardware accelerators & easy deployment OpenCV OpenVX Sample
Model Optimizer Inference Engine
IR
across Intel® CPUs, GPUs, FPGAs, VPUs. Convert & Optimize Optimized Inference
For Intel CPU & GPU/Intel® Processor Graphics
IR = Intermediate Representation file
Who needs this product?
▪ Computer vision, deep learning software developers Open Model Zoo Tools & Libraries

▪ Data scientists Increase Media/Video/Graphics Performance


40+
Model
▪ OEMs, ISVs, System Integrators Pretrained Samples OpenCL™ Drivers
Downloader Intel® Media SDK
Models Open Source Version & Runtimes
Usages For GPU/Intel® Processor Graphics
Deep Learning Workbench
Security surveillance, robotics, retail, healthcare, AI, office
automation, transportation, non-vision use cases (speech, Optimize Intel® FPGA (Linux* only)
NLP, Audio, text) & more Calibration Model Benchmark Accuracy Aux. FPGA RunTime Environment
Tool Analyzer App Checker Capabilities Bitstreams
OpenCL™)

Edge AI &
Vision Alliance

Back to Domain-specific Toolkits for Specialized Workloads 25


SATG
Parallelism on the Intel Architecture
Enabled by Intel® oneAPI Products

SATG 26
The Seven Levels of Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

SATG 27
Inter- Node Parallelism

1 Inter- Node Parallelism

SATG 29
Inter- Node Parallelism

▪ Most common scenario of HPC


/ Cluster applications
▪ Nodes connected by fast
interconnect called fabric
▪ Partitioned Global Address
Space (PGAS) & Distributed
Memory Programming
▪ Message Passing Interface (MPI)
– most common approach

SATG 30
Inter- Node Parallelism
What can I do? Which Tools?
▪ employ communication- ▪ Identify scalability issues with the
VTune™ Profiler Application
avoiding algorithms (e.g. Performance Snapshot & the
neighborhood collectives) Trace Analyzer and Collector
▪ overlap compute and ▪ Fine Tune Collective Operations
communication where possible with Intel MPI – autotuner
▪ Load balance work between ▪ Where applicable
ranks • Math Kernel Library (oneMKL)
• Data Analytics Library (oneDAL)
▪ In case of well scaling
• Collective Communications Library
application, add more nodes ☺ (oneCCL)

SATG 31
Inter- Node Parallelism

VTune HPC Analysis (right)


Application Performance
Snapshots HTML export (left)

SATG 32
Intra- Node Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

SATG 34
Intra- Node Parallelism
▪ Dual-Socket system – most
common datacenter node
▪ Cache Coherent Non-Uniform
Memory Access (ccNUMA) System
▪ Shared Memory based Parallel
Programming – addressed by
multiple Processes (e.g. MPI) and /
or multiple Threads per Process
(e.g. OpenMP)
▪ Possibility of accelerators
connected via PCI or similar (e.g.
CXL)

SATG 35
Intra- Node Parallelism

What can I do? Which Tools?


▪ Identify issues with the VTune™
▪ Tune for best Memory Locality Profiler
(NUMA optimization) • Hotspots
▪ Employ proper process pinning • Memory Access
• HPC Performance Characterization
▪ Avoid frequent involvement of
Operating System (OS) Kernel ▪ Identify issues with the VTune™
Profiler Application Performance
e.g. by using scalable memory Snapshot
allocators

SATG 36
Core / Thread Level Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

SATG 37
Core / Thread Level Parallelism

▪ Multi-Core CPU with 10s of


cores interconnected by a ring
or mesh
▪ Some levels of private cache
(L1/L2) and shared cache (L3)
▪ Dynamically scaling core
frequencies
▪ Shared Memory based Parallel
Programming

SATG 38
Core / Thread Level Parallelism

What can I do? Which Tools?


▪ Tune for best Memory Access ▪ Identify issues with the VTune™
(Cache-Blocking, non-temporal Profiler
stores, …) • Hotspots
▪ Load-Balance Threading ▪ Math Kernel Library
▪ Utilize performance optimized ▪ Threading Building Blocks
libraries wherever possible
▪ Intel C++ / Fortran Compiler –
OpenMP Runtime

SATG 39
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

SATG 40
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
▪ X86 ISA Microprocessor SMT Core
▪ Partitioned Front End with multiple
buffers keep arch. state
▪ Shared Backend with execution
blocks
▪ SMT driving Instruction Level
Parallelism (ILP) increase, while
complementing Out of Order
Execution
▪ Implicit Programming Model -
Threading
SATG 41
SMT also known as Hyperthreading
w/o SMT

Architectural Architectural
State A State B

Time (proc. cycles)


Scheduler
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

ALU ALU ALU ALU Store


INT

Shift LEA LEA Shift Load Load


JMP MUL JMP Data Store
Store Store
FMA FMA Address Address Address
Buffer
ALU ALU Load Data 2
VEC

Shift Shuffle Load Data 3 Memory Control


DIV

SATG 42
Thread-Level Parallelism with Hyperthreading
(SMT)
What can I do? Which Tools?
▪ Enable SMT in the BIOS ▪ Identify issues with the VTune™
Profiler
▪ Use pinning strategies to • Microarchitecture Analysis
leverage / not leverage SMT
▪ MKL BLAS3 routines leverage SMT
• OMP_PLACES / KMP_AFFINITY
▪ Use the Pinning Simulator for
• I_MPI_PIN_CELL etc. Intel® MPI Library
▪ Effective SMT usage requires ▪ Use the Intel Compiler OpenMP
instruction diversity between Runtime environment variables /
APIs for pinning
threads
SATG 43
Instruction-level Parallelism (ILP)

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

SATG 44
Instruction-level Parallelism (ILP)
▪ X86 ISA Microprocessor SMT Core
▪ Complex Instruction Set Computer
(CISC) Frontend with Reduced
Instruction Set Computer (RISC)
Backend
▪ SuperScalar Out-of-Order Backend –
increasing ILP (Speculative Execution)
▪ Von Neumann Architecture (L1+) and
Harvard Architecture (L1) Core
▪ Pipelining (parallelism over time)
• Through Front-end & Back-end stages
• Through Execution Units (ALUs)

SATG 45
Instruction-level Parallelism (ILP)

Decoders 5
Decoders
Decoders
32KB L1 I$ Pre decode Inst Q Decoders
μop
6
Front End Branch Prediction Unit μop Cache
Queue

Load Store Reorder Allocate/Rename/Retire


Buffer Buffer Buffer In order

Scheduler OOO

Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7

ALU ALU ALU ALU


Store Data Load/STA Load/STA STA
Shift LEA LEA Shift
INT

JMP 2 MUL JMP 1

FMA FMA FMA Load Data 2


Load Data 3 Memory Control
Memory
ALU ALU ALU
VEC

Shift Shift Shuffle


DIV

1MB L2$ Fill Buffers 32KB L1 D$

Fill Buffers

SATG 46
Instruction-level Parallelism (ILP)
What can I do? Which Tools?
▪ ILP mostly done by the Compiler – ▪ Identify issues with the VTune™
use the right compiler Profiler
▪ Effective ILP usage requires • Microarchitecture Analysis
instruction diversity wherever
possible ▪ Intel® DPC++/C++ Compiler,
Intel® C++ Compiler Classic,
▪ Some instructions may stall faster Intel® Fortran Compiler Classic,
than others – e.g. try to replace
& Intel® Fortran Compiler (Beta)
multiple divisions by reciprocal

SATG 47
Data Level Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

SATG 48
Data Level Parallelism

Fused Multiply Add (FMA) ▪ X86 ISA Extensions


AVX-512 Vector Instruction ▪ Flynn's taxonomy: Single
Instruction Multiple Data (SIMD)
D = (A + C) x B
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64
▪ Code candidates: Loops – to
split iterations in chunks of
SIMD width – reducing the trip
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64

FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64


count
+ + + + + + + + ▪ Vectorization can happen
× × × × × × × × implicitly with Compiler
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64
generating instructions

SATG 49
50
Data Level Parallelism
64 Bit
1996 1999
256 Bit 512 Bit
2011

2001 2012 2013


128 Bit AVX
MMX SSE 2004
64 bit 128 bit AVX-512
2006 256 bit
57 instr SSE2 AVX2
Integer 2007
Types only 70 instr SSE3 ~100 new
512 bit
8 Registers Single- instr. 512-bit vector
Precision
2008
Vectors
SSSE3 ~300 legacy 16 new 512-
sse instr bit registers
Streaming updated
operations 13 instr SSE4.1 2009
256-bit vector 8 opmask
Complex registers
Data 3 and 4-
144 instr SSE4.2 operand
32 instr instructions
Double- Decode Int. AVX
precision AES-NI expands to
Vectors 47 instr
256 bit
8/16/32 Video
Improved bit

SIMD
7 instr
64/128-bit Graphics manip.
vector building Encryption
integer blocks and fma
Decryption Vector shifts
Advanced

Enhancements
vector instr Key Gather
8 instr Generation
String/XML
processing
POP-Count
CRC
a[3] a[2] a[1] a[0]
for (i=0;i<MAX;i++) + + + +

b[3] b[2] b[1] b[0]


c[i]=a[i]+b[i];
c[3] c[2] c[1] c[0]

SATG 50
51

Data Level Parallelism


Performance Libraries (e.g. IPP and MKL) Ease of use

Compiler: Fully automatic vectorization

Compiler: Auto vectorization hints


(#pragma ivdep, …)

Explicit vectorization with OpenMP* 4.0


and later (SIMD Directive)

SIMD intrinsic class (F32vec4 add)

Vector intrinsic (mm_add_ps())

Assembler code (addps) Programmer control

SATG 51
Data Level Parallelism
What can I do? Which Tools?
▪ Use the most advanced Vector ISA ▪ Intel® DPC++/C++ Compiler, Intel®
available on your machine – the C++ Compiler Classic, Intel® Fortran
compiler default is SSE2 Compiler Classic, & Intel® Fortran
▪ Add code pragmas to the non- Compiler (Beta)
vectorized loops in order to feed
compiler with domain knowledge ▪ Use the Intel® Advisor to
• Identify vector code candidates
▪ Re-arrange code to achieve
• Determine vector efficiencies with
• loops with known trip count at runtime
• branch-free loops to increase efficiency
• Determine code efficiencies with the
Roofline Analysis
▪ Tell the compiler to use full 512bit • Step-by-Step guide towards efficient
vector width in case of compute heavy vector code
loops

Programming Guidelines for Vectorization


SATG 52
Data Level Parallelism

Advisor Vectorized Loop


Summary (top-left)
Advisor
Roofline
(bottom right)
SATG 53
Data Level Parallelism – legacy instructions
The X87 Coprocessor
▪ FP extension of the 8086 ISA -
later (80486) integrated
▪ Today, x87 instructions rarely
used, If however, numerical results
might differ (80Bit intermediate)!!!
▪ Avoid these instructions in favor
What is / was that? of reproducible result for both,
scalar & vector code

SATG 54
Accelerator / Offloading Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

SATG 58
Accelerator / Offloading Parallelism

▪ GPGPU – denser compute


▪ Slice -> SubSlice -> Execution
Units (EUs)
▪ Each EU has
• 8 Threads (similar to SMT)
• 28KB Register File
• 2x 4-wide SIMD units
▪ Offload based Programming
enabled by oneAPI
SATG 59
Accelerator / Offloading Parallelism
Data Parallel C++ (DPC++) -
▪ Intel® oneAPI Accelerator
Hello World
#include <CL/sycl.hpp>
using namespace sycl;
Programming
int main() {
std::vector<float> A(1024), B(1024), C(1024);
• C/C++
// some data initialization
{ • Intel ® DPC++ Compiler
buffer bufA {A}, bufB {B}, bufC {C};
queue q;
q.submit([&](handler &h) { • Intel® C++ Compiler with OpenMP
auto A = bufA.get_access(h, read_only); Offloading
auto B = bufB.get_access(h, read_only);
auto C = bufC.get_access(h, write_only);
h.parallel_for(1024, [=](auto i){
C[i] = A[i] + B[i];
• Fortran
});
}); • Intel ® Fortran Compiler with OpenMP
}
for (int i = 0; i < 1024; i++) Offloading
std::cout << "C[" << i << "] = " << C[i] << std::endl;

SATG 60
Accelerator / Offloading Parallelism

What can I do? Which Tools? – Intel® oneAPI


▪ Intel® DPC++/C++ Compiler &
▪ Enable your code today for Intel Intel® Fortran Compiler (Beta)
Integrated Graphics GPUs and ▪ Intel® oneMKL
upcoming datacenter GPGPUs
▪ Use the Intel® Advisor to
▪ Reduce data transfers between • Find offloading candidates with the
host and device Offload Advisor
• Estimate code speedup of GPU with
▪ Minimize communication / the Offload Advisor
synchronization in between ▪ Use Intel® VTune™ Profiler to
individual threads • Analyze the offload performance

SATG 61
Accelerator / Offloading Parallelism
Offload Advisor – Efficiency VTune – GPU Offload Analysis

SATG 62
The Seven Levels of Parallelism

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

5 Instruction-level Parallelism

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

SATG 69
The Seven Levels of Parallelism
Parallelism on the Intel Architecture Supported by the
Intel® oneAPI Base & HPC Toolkits
Inter- Node Parallelism
Intel oneAPI Tools for HPC

Direct Programming API-Based Programming Analysis Tools


Intra- Node Parallelism
Intel® C++ Compiler Intel® MPI Library
Classic Intel® Inspector

Core / Thread Level Parallelism Intel® Fortran Compiler


Classic
Intel® oneAPI
DPC++ Library
Intel® Trace
Analyzer & Collector

Intel® Fortran Compiler Intel® oneAPI


(Beta) Math Kernel Library Intel® Cluster Checker

Thread-Level Parallelism with SMT Intel® oneAPI


DPC++/C++ Compiler
Intel® oneAPI
Data Analytics Library
Intel® VTune™ Profiler

Intel® DPC++ Intel® oneAPI Threading


Compatibility Tool Building Blocks Intel® Advisor

Instruction-level Parallelism Intel® Distribution


for Python*
Intel® oneAPI Video
Processing Library
Intel® Distribution for
GDB*
Intel® FPGA Add-on for Intel® oneAPI Collective

Data Level Parallelism oneAPI Base Toolkit Communications


Library

Intel® oneAPI Deep


Neural Network Library

Accelerator / Offloading Parallelism Intel® oneAPI HPC Toolkit + Intel® Integrated


Performance Primitives
Intel® oneAPI Base Toolkit

SATG 70
Performance Analysis Tools for Diagnosis
Intel® Parallel Studio

N Intel® Trace Analyzer


Cluster Tune MPI & Collector

Application Performance Snapshot


Scalable
? Intel® MPI Tuner
Y
Intel® VTune™ Amplifier’s

Effective Memory
Y Bandwidth N
threading Vectorize
Sensitive
? ?
N Y

Optimize
Thread
Bandwidth

Intel® Intel®
71

VTune™ Profiler Intel® Advisor VTune™ Profiler

SATG 71
72

Summary
Code Modernization is not always ‘easy & quick’ (a.k.a. “rewrite”).
Common methodology is to (1) analyze, then (2) optimize.

Two themes consistently re-occur – Task and Data Parallelism.

To achieve best CPU performance, make sure your programs are


efficiently vectorized and are well parallelized.
Especially the latter one also counts for GPUs.

SATG 72
(Recorded) Tools Webinars
▪ Intel® oneAPI technical Webinar ( 2 days) - co-hosted with Zuse Institute Berlin
▪ oneAPI Webinar on March 2nd and 3rd, 2021
▪ https://www.zib.de/workshops/2021/oneapi
▪ https://www.hlrn.de/doc/display/PUB/Joint+NHR@ZIB+-+INTEL+++oneAPI+Workshop
▪ Intel® oneAPI Rendering Tookit Webinar ( 1 day) :
▪ Rendering 08.06.2021:
https://www.ai-spektrum.de/veranstaltungen/intel-oneapi-rendering-toolkit-workshop.html
▪ Make use of simulated data and virtualize them in high fidelity photo realistic fashion
▪ Intel® oneAPI AI Analytics Toolkit webinar – 1 day :
▪ AI webinar on 15.06.2021:
https://www.ai-spektrum.de/veranstaltungen/artificial-intelligence-on-intelr-platforms-using-intelr-
oneapi-ai-analytics-toolkit-openvino-workshop.html
▪ Intel performance optimized AI tools kits for classic and machine learning

SATG Copyright ©, Intel Corporation. All rights reserved. 73


*Other names and brands may be claimed as the property of others.
Selected Links : Intel Tools User‘s Manuals
• Intel® C++ Compiler Classic Developer Guide and Reference
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top.html

• Intel Data Parallel C++


https://link.springer.com/content/pdf/10.1007%2F978-1-4842-5574-2.pdf

• Intel VTune Cookbook


This Cookbook introduces methodologies and use-case recipes to analyse the performance of your code with VTune Profiler , a tool that
helps you identify ineffective algorithm and hardware usage and provides tuning advice.
https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top.html

• Intel® Advisor Cookbook


The Intel® Advisor Cookbook contains step-by-step instructions to help effectively use more cores, vectorization, or heterogeneous
processing using Intel Advisor.
https://www.intel.com/content/www/us/en/develop/documentation/advisor-cookbook/top.html

• Intel Inspector User Guide(s) for Linux and for Windows


Get Started with Intel® Inspector -Windows* OS . This document explains how to get started with the Intel® Inspector workflows.
Linux :
https://www.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top.html

▪ Windows :
https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-inspector/top/windows.html

• Microsoft Visual Studio* Integration of Intel VTune


https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/launch/microsoft-visual-studio-integration.html

SATG Copyright ©, Intel Corporation. All rights reserved. 74


*Other names and brands may be claimed as the property of others.
Other useful Links
• Intel Software Developer Manuals
These manuals describe the architecture and programming environment of the Intel® 64 and IA-32 architectures.

Intel® 64 and IA-32 Architectures Software Developer Manuals

• Intel Software Development Tools Magazine ‘Intel Parallel Universe’

https://www.intel.com/content/www/us/en/developer/community/parallel-universe-magazine/overview.html?s=Newest

• Intel Developer Zone

https://www.intel.com/content/www/us/en/developer/overview.html

SATG Copyright ©, Intel Corporation. All rights reserved. 75


*Other names and brands may be claimed as the property of others.
QUESTIONS?

SATG 76
77

You might also like