OneAPI Introduction and Parallelization

Performance Optimization
for Software Developers

with
oneAPI & Parallelization on Intel®

Architecture
Edmund Preiss
Business Development Manager
Software and Advanced Technology Group (SATG)
NOTICES AND DISCLAIMERS
Refer to https://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel
software products.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn
more at [intel.com].
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and
roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
Other names and brands may be claimed as the property of others.
© Intel Corporation
SATG 2
Welcome in the Parallelism Area !
Microprocessor frequency Microprocessor frequency

over Time (history) versus Power consumption
… performance is not only the computer architect’s job anymore
… performance increase is increasingly the job of the software developer
SATG 3
Welcome in the Parallelism Area !
Source: © 2014, James Reinders, Intel, used with permission
Microprocessor frequency Microprocessor frequency

over Time (history) versus Power consumption
… performance is not only the computer architect’s job anymore
… performance increase is increasingly the job of the software developer
SATG 4
Amdahl’s Law
“The speedup of a program
using multiple processors
in parallel computing is limited
by the sequential fraction of
the program.”
- Gene Amdahl
Speedup = (s + p ) / (s + p / N )
= 1 / (s + p / N )
SATG 5
The ‘Free Lunch’ is over
“Parallelism == > Performance”

(leads to)
Optimization – Developer to make sure the

above statement becomes reality!
SATG 6
Parallelism on the Intel Architecture
Developer
Tools
Architecture
SATG 7
8
Code Optimization Principles
Stage 1: Use Optimized Libraries
Stage 2: Compile with Architecture-specific

Optimizations
Stage 3: Analysis and Tuning
Stage 4: Check Correctness
SATG 8
oneAPI – A Tools Development Framework
Open Industry Intel Product

Specification
SATG 9
Programming Challenges
for Multiple Architectures Application Workloads Need Diverse Hardware
Scalar Vector Spatial Matrix
Growth in specialized workloads

Middleware & Frameworks
Variety of data-centric hardware required
Requires separate programming models and toolchains CPU GPU FPGA Other accel.
for each architecture programming
model
programming
model
programming
model
programming
models
Software development complexity limits freedom of

architectural choice
CPU GPU FPGA Other accel.
XPUs
SATG 10
Introducing oneAPI
Application Workloads Need Diverse Hardware
Scalar Vector Spatial Matrix
Cross-architecture programming that delivers freedom

to choose the best hardware
Based on industry standards and open specifications
Exposes cutting-edge performance features of latest
hardware Industry Intel
Initiative Product
Compatible with existing high-performance languages

and programming models including C++, OpenMP, XPUs
Fortran, and MPI
CPU GPU FPGA Other accel.
SATG 11
oneAPI Industry Initiative Application Workloads Need Diverse Hardware
Break the Chains of Proprietary Lock-in
...
A cross-architecture language based on C++ and SYCL
standards
oneAPI Industry Specification
Powerful libraries designed for acceleration of domain- Direct Programming API-Based Programming
specific functions Libraries

DPC++
Math Threading
Library
Low-level hardware abstraction layer Data Parallel C++ Analytics/
ML
DNN ML Comm
Video Processing
Open to promote community and industry collaboration
Low-Level Hardware Interface
Enables code reuse across architectures and vendors
XPUs
The productive, smart path to freedom for
accelerated computing from the economic
and technical burdens of proprietary CPU GPU FPGA Other accel.
programming models
Visit oneapi.com for more details 12

SATG
Data Parallel C++ Standards-based,
DPC++ = ISO C++ and Khronos SYCL
Cross-architectural Language
Parallelism, productivity, and performance for CPUs and
accelerators Direct Programming:
▪ Delivers accelerated computing by exposing hardware features
Data Parallel C++
▪ Allows code reuse across hardware targets, while permitting custom
tuning for specific accelerators
▪ Provides an open, cross-industry solution to single-architecture Community Extensions
proprietary lock-in
Based on C++ and SYCL Khronos SYCL

▪ Delivers C++ productivity benefits, using common, familiar C and C++
constructs
▪ Incorporates SYCL from the Khronos Group to support data parallelism ISO C++
and heterogeneous programming
Community Project to drive language enhancements

▪ Provides extensions to simplify data parallel programming
▪ Continues evolution through open and cooperative development
Apply your skills to the next innovation, not to

rewriting software for the next hardware platform
The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs.
SATG announced a DPC++ compiler that targets Nvidia GPUs.
Codeplay 13
Powerful oneAPI Libraries
▪ Designed for acceleration of key domain-specific
functions Intel® oneAPI Math Kernel
Intel® oneAPI Deep Neural
Network Library
Library oneMKL
oneDNN
▪ Pre-optimized for each target platform for
maximum performance Intel® oneAPI Video Intel® oneAPI Data Analytics
Processing Library Library
oneVPL oneDAL
Intel® oneAPI Threading Intel® oneAPI Collective

Building Blocks Communications Library
oneTBB oneCCL
Intel® oneAPI DPC++ Library

oneDPL
SATG 14
oneAPI – A Tools Development Framework
Open Industry Intel Product

Specification
SATG 15
Intel® oneAPI
Product Application Workloads Need Diverse Hardware
Built on a Rich Heritage of Tools

Based on Intel® Xeon® Processors, Middleware & Frameworks
Now Expanded to XPUs
...
A complete set of advanced compilers,
libraries, and porting, analysis & debugger
tools Intel® oneAPI Product
▪ Accelerates compute by exploiting cutting-edge

hardware features Compatibility Tool Languages Libraries
Analysis & Debug
Tools
▪ Interoperable with existing programming models
and code bases (C++, Fortran, Python, OpenMP, Low-Level Hardware Interface
etc.), developers can be confident that existing
applications work seamlessly with oneAPI XPUs
▪ Eases transitions to new systems and

accelerators⎯using a single code base frees CPU GPU FPGA
developers to invest more time on innovation
Available Now
Visit software.intel.com/oneapi for more details

SATGcapabilities may differ per architecture and custom-tuning will still be required. Other accelerators to be supported in the future.
Some 16
Intel Compiler Transition: Classic to LLVM
Start Your Migration Now
▪ Expand to XPUs Features &
▪ Modern LLVM Infrastructure Performance
Intel® C++ Compiler Classic
Intel® oneAPI DPC++/C++ Compiler (CPU only)

▪ Use for all new projects Intel® oneAPI DPC++/C++ Compiler
▪ Migrate legacy projects
(CPU, GPU, FPGA)
Intel® Fortran Compiler (Beta) Intel® Fortran Compiler Classic
▪ Test drive now & Provide feedback Features &

(CPU only)
Performance
▪ Prepare for migration Intel® Fortran Compiler (Beta)
(CPU, GPU)
Time
Today
Performance, Quality, and Support Continues

SATG 17
Intel® oneAPI Toolkits
A Complete Set of Proven Developer Tools Expanded from CPU to XPU
A core set of high-performance tools for

building C++, Data Parallel C++ applications
& oneAPI library-based applications
Native Code Developers
Intel® oneAPI
Intel® oneAPI Intel® oneAPI Rendering
Tools for HPC Tools for IoT Toolkit
Deliver fast Fortran, Build efficient, reliable Create performant,
OpenMP & MPI solutions that run at high-fidelity visualization
applications that scale network’s edge applications
Specialized Workloads
Intel® AI Analytics Intel® Distribution of

Toolkit OpenVINO™ Toolkit
Accelerate machine learning & Deploy high performance
data science pipelines with inference & applications
optimized DL frameworks & high- Toolkit from edge to cloud
performing Python libraries
Data Scientists & AI Developers
SATG 18
Streamlining Product Line
Composer Professional
Intel® oneAPI Base & HPC Toolkit
Edition Edition
Single-Node
Cluster Intel® oneAPI Base & HPC Toolkit

Edition Multi-Node
CPU (Core & Xeon)
CPU Focused (Core & Xeon) GPU (Intel)

FPGA (Intel)
SATG 19
Intel® oneAPI Intel® oneAPI
Intel Base
oneAPI & HPC
Tools Toolkit
for HPC
Base & HPC Toolkit Direct Programming API-Based Programming Analysis & debug Tools
Intel® C++ Compiler Classic Intel® MPI Library Intel® Inspector
Intel® oneAPI Tools for HPC: Deliver Intel® Fortran Compiler Classic Intel® oneAPI DPC++ Library
Intel® Trace Analyzer &
Collector
Fast Applications that Scale
Intel® Fortran Compiler Intel® oneAPI Math Kernel
Intel® Cluster Checker
(Beta) Library
What is it?
Intel® oneAPI DPC++/C++ Intel® oneAPI Data Analytics
A toolkit that adds to the Intel® oneAPI Base Toolkit for Compiler Intel® VTune™ Profiler
Library
building high-performance, scalable parallel code on
C++, Fortran, OpenMP & MPI from enterprise to cloud, Intel® oneAPI Threading
Intel® DPC++ Compatibility Tool Intel® Advisor
and HPC to AI applications. Building Blocks
Who needs this product? Intel® Distribution for Python* Intel® oneAPI Video Processing
Library
Intel® Distribution for GDB*
▪ OEMs/ISVs
Intel® FPGA Add-on for oneAPI Intel® oneAPI Collective
▪ C++, Fortran, OpenMP, MPI Developers Base Toolkit Communications Library
Why is this important? Intel® oneAPI Deep Neural

Network Library
▪ Accelerate performance on Intel® Xeon® & Core™
Intel® Integrated Performance
Processors and Accelerators Intel® oneAPI HPC Toolkit +
Primitives
▪ Deliver fast, scalable, reliable parallel code with less Intel® oneAPI Base Toolkit
effort; built on industry standards
Learn More: intel.com/oneAPI-HPCKit Back to Domain-specific Toolkits for Specialized Workloads 22

SATG
Render Your Vision in Highest Fidelity
Intel® oneAPI Intel oneAPI Rendering & Ray Tracing Libraries
Rendering Toolkit Intel® Embree

High-Performance, Feature-Rich Ray Tracing &
Photorealistic Rendering 1
Intel® Open Image Denoise

Powerful Libraries for High-Fidelity AI-Accelerated Denoiser for Superior Visual
Visualization Applications
Quality 2
▪ Deliver high-performance, high-fidelity visualization Intel® OpenSWR

applications on Intel® architecture High-Performance, Scalable, OpenGL*-
Compatible Rasterizer 3
▪ Create amazing visual, hyper-realistic renderings via ray

tracing with global illumination Intel® Open Volume Kernel Library
▪ Access all system memory space to create renderings using Render & Simulate 3D Spatial Data Processing
the largest data sets 4
▪ Flexible, cost efficient development Intel® OSPRay

using open source libraries Scalable, Portable, Distributed Rendering API
Intel® OSPRay Studio

Real-time rendering through a graphical user
interface with this new scene graph application
Intel® Embree,
part of Intel® oneAPI Rendering Toolkit, Intel® OSPRay for Hydra
won an Academy Award® Technical Connect the Rendering Toolkit libraries to
Achievement Award in 2021 Universal Scene Description Hydra Rendering
subsystem via plugin 5
1 Avengers: Infinity War - Digital Domain, Marvel Studios, Chaos Group V-Ray
2 Scene courtesy of Frank Meinl
3 Model from Leigh Orf at University of Wisconsin. For more tornado visualization, visit Leigh Orf's site
4 Smoke volume, data courtesy OpenVDB example repository
5 Moana Island Scene, Walt Disney Animation Studios , publicly available dataset: 15fps+,~160 billion prims
Learn
SATG More: intel.com/oneAPI-RenderKit Back to Domain-specific Toolkits for Specialized Workloads 23
Intel® oneAPI
AI Analytics Deep Learning Data Analytics & Machine Learning
Toolkit Intel® Optimization for

TensorFlow
Accelerated Data Frames
Intel® Distribution of Modin OmniSci Backend

Accelerate end-to-end AI and data analytics
pipelines with libraries optimized for Intel® Intel® Optimization for PyTorch
Intel® Distribution for Python
architectures
Intel® Low Precision Optimization
XGBoost Scikit-learn Daal-4Py
Tool
Who Uses It? Model Zoo for Intel® Architecture NumPy SciPy Pandas
Data scientists, AI researchers, ML and DL developers, AI
application developers
Samples and End2End Workloads
Top Features/Benefits CPU GPU
▪ Deep learning performance for training and inference Supported Hardware Architechures1
with Intel optimized DL frameworks and tools
Hardware support varies by individual tool. Architecture support will be expanded over time.
Other names and brands may be claimed as the property of others.
▪ Drop-in acceleration for data analytics and machine
learning workflows with compute-intensive Python
packages Get the Toolkit HERE or via these locations
AI
ANALYTICS
TOOLKIT
Intel Installer Docker Apt, Yum Conda Intel® DevCloud
Learn 24
SATG More: software.intel.com/oneapi/ai-kit Back to Domain-specific Toolkits for Specialized Workloads 24
Intel® Distribution of OpenVINO™ toolkit
Powered by oneAPI
Deliver High-Performance Deep Intel® Distribution of OpenVINO™ toolkit

Learning Inference Deep Learning Traditional Computer Vision
A toolkit to accelerate development of high-performance
deep learning inference & computer vision in vision/AI Intel® Deep Learning Deployment Toolkit Optimized Libraries & Code Samples
applications used from edge to cloud. It enables deep
learning on hardware accelerators & easy deployment OpenCV OpenVX Sample
Model Optimizer Inference Engine
IR
across Intel® CPUs, GPUs, FPGAs, VPUs. Convert & Optimize Optimized Inference
For Intel CPU & GPU/Intel® Processor Graphics
IR = Intermediate Representation file
Who needs this product?
▪ Computer vision, deep learning software developers Open Model Zoo Tools & Libraries
▪ Data scientists Increase Media/Video/Graphics Performance

40+
Model
▪ OEMs, ISVs, System Integrators Pretrained Samples OpenCL™ Drivers
Downloader Intel® Media SDK
Models Open Source Version & Runtimes
Usages For GPU/Intel® Processor Graphics
Deep Learning Workbench
Security surveillance, robotics, retail, healthcare, AI, office
automation, transportation, non-vision use cases (speech, Optimize Intel® FPGA (Linux* only)
NLP, Audio, text) & more Calibration Model Benchmark Accuracy Aux. FPGA RunTime Environment
Tool Analyzer App Checker Capabilities Bitstreams
OpenCL™)
Edge AI &
Vision Alliance
Back to Domain-specific Toolkits for Specialized Workloads 25

SATG
Parallelism on the Intel Architecture
Enabled by Intel® oneAPI Products
SATG 26
The Seven Levels of Parallelism
1 Inter- Node Parallelism
2 Intra- Node Parallelism
3 Core / Thread Level Parallelism
4 Thread-Level Parallelism with SMT
5 Instruction-level Parallelism
6 Data Level Parallelism
7 Accelerator / Offloading Parallelism
SATG 27
Inter- Node Parallelism
SATG 29
▪ Most common scenario of HPC

/ Cluster applications
▪ Nodes connected by fast
interconnect called fabric
▪ Partitioned Global Address
Space (PGAS) & Distributed
Memory Programming
▪ Message Passing Interface (MPI)
– most common approach
SATG 30
What can I do? Which Tools?
▪ employ communication- ▪ Identify scalability issues with the
VTune™ Profiler Application
avoiding algorithms (e.g. Performance Snapshot & the
neighborhood collectives) Trace Analyzer and Collector
▪ overlap compute and ▪ Fine Tune Collective Operations
communication where possible with Intel MPI – autotuner
▪ Load balance work between ▪ Where applicable
ranks • Math Kernel Library (oneMKL)
• Data Analytics Library (oneDAL)
▪ In case of well scaling
• Collective Communications Library
application, add more nodes ☺ (oneCCL)
SATG 31
VTune HPC Analysis (right)

Application Performance
Snapshots HTML export (left)
SATG 32
Intra- Node Parallelism
SATG 34
▪ Dual-Socket system – most
common datacenter node
▪ Cache Coherent Non-Uniform
Memory Access (ccNUMA) System
▪ Shared Memory based Parallel
Programming – addressed by
multiple Processes (e.g. MPI) and /
or multiple Threads per Process
(e.g. OpenMP)
▪ Possibility of accelerators
connected via PCI or similar (e.g.
CXL)
SATG 35

▪ Identify issues with the VTune™
▪ Tune for best Memory Locality Profiler
(NUMA optimization) • Hotspots
▪ Employ proper process pinning • Memory Access
• HPC Performance Characterization
▪ Avoid frequent involvement of
Operating System (OS) Kernel ▪ Identify issues with the VTune™
Profiler Application Performance
e.g. by using scalable memory Snapshot
allocators
SATG 36
Core / Thread Level Parallelism
SATG 37
▪ Multi-Core CPU with 10s of

cores interconnected by a ring
or mesh
▪ Some levels of private cache
(L1/L2) and shared cache (L3)
▪ Dynamically scaling core
frequencies
▪ Shared Memory based Parallel
Programming
SATG 38

▪ Tune for best Memory Access ▪ Identify issues with the VTune™
(Cache-Blocking, non-temporal Profiler
stores, …) • Hotspots
▪ Load-Balance Threading ▪ Math Kernel Library
▪ Utilize performance optimized ▪ Threading Building Blocks
libraries wherever possible
▪ Intel C++ / Fortran Compiler –
OpenMP Runtime
SATG 39
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
SATG 40
Thread-Level Parallelism with Simultaneous
MultiThreading (SMT)
▪ X86 ISA Microprocessor SMT Core
▪ Partitioned Front End with multiple
buffers keep arch. state
▪ Shared Backend with execution
blocks
▪ SMT driving Instruction Level
Parallelism (ILP) increase, while
complementing Out of Order
Execution
▪ Implicit Programming Model -
Threading
SATG 41
SMT also known as Hyperthreading
w/o SMT
Architectural Architectural
State A State B
Time (proc. cycles)

Scheduler
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7
ALU ALU ALU ALU Store

INT
Shift LEA LEA Shift Load Load

JMP MUL JMP Data Store
Store Store
FMA FMA Address Address Address
Buffer
ALU ALU Load Data 2
VEC
Shift Shuffle Load Data 3 Memory Control

DIV
SATG 42
Thread-Level Parallelism with Hyperthreading
(SMT)
▪ Enable SMT in the BIOS ▪ Identify issues with the VTune™
Profiler
▪ Use pinning strategies to • Microarchitecture Analysis
leverage / not leverage SMT
▪ MKL BLAS3 routines leverage SMT
• OMP_PLACES / KMP_AFFINITY
▪ Use the Pinning Simulator for
• I_MPI_PIN_CELL etc. Intel® MPI Library
▪ Effective SMT usage requires ▪ Use the Intel Compiler OpenMP
instruction diversity between Runtime environment variables /
APIs for pinning
threads
SATG 43
Instruction-level Parallelism (ILP)
SATG 44
▪ X86 ISA Microprocessor SMT Core
▪ Complex Instruction Set Computer
(CISC) Frontend with Reduced
Instruction Set Computer (RISC)
Backend
▪ SuperScalar Out-of-Order Backend –
increasing ILP (Speculative Execution)
▪ Von Neumann Architecture (L1+) and
Harvard Architecture (L1) Core
▪ Pipelining (parallelism over time)
• Through Front-end & Back-end stages
• Through Execution Units (ALUs)
SATG 45
Decoders 5
Decoders
Decoders
32KB L1 I$ Pre decode Inst Q Decoders
μop
6
Front End Branch Prediction Unit μop Cache
Queue
Load Store Reorder Allocate/Rename/Retire

Buffer Buffer Buffer In order
Scheduler OOO
Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7
ALU ALU ALU ALU

Store Data Load/STA Load/STA STA
Shift LEA LEA Shift
INT
JMP 2 MUL JMP 1
FMA FMA FMA Load Data 2

Load Data 3 Memory Control
Memory
ALU ALU ALU
VEC
Shift Shift Shuffle

DIV
1MB L2$ Fill Buffers 32KB L1 D$
Fill Buffers
SATG 46
▪ ILP mostly done by the Compiler – ▪ Identify issues with the VTune™
use the right compiler Profiler
▪ Effective ILP usage requires • Microarchitecture Analysis
instruction diversity wherever
possible ▪ Intel® DPC++/C++ Compiler,
Intel® C++ Compiler Classic,
▪ Some instructions may stall faster Intel® Fortran Compiler Classic,
than others – e.g. try to replace
& Intel® Fortran Compiler (Beta)
multiple divisions by reciprocal
SATG 47
Data Level Parallelism
SATG 48
Fused Multiply Add (FMA) ▪ X86 ISA Extensions

AVX-512 Vector Instruction ▪ Flynn's taxonomy: Single
Instruction Multiple Data (SIMD)
D = (A + C) x B
FP64 FP64 FP64 FP64 FP64 FP64 FP64 FP64
▪ Code candidates: Loops – to
split iterations in chunks of
SIMD width – reducing the trip

count
+ + + + + + + + ▪ Vectorization can happen
× × × × × × × × implicitly with Compiler
generating instructions
SATG 49
50
64 Bit
1996 1999
256 Bit 512 Bit
2011
2001 2012 2013

128 Bit AVX
MMX SSE 2004
64 bit 128 bit AVX-512
2006 256 bit
57 instr SSE2 AVX2
Integer 2007
Types only 70 instr SSE3 ~100 new
512 bit
8 Registers Single- instr. 512-bit vector
Precision
2008
Vectors
SSSE3 ~300 legacy 16 new 512-
sse instr bit registers
Streaming updated
operations 13 instr SSE4.1 2009
256-bit vector 8 opmask
Complex registers
Data 3 and 4-
144 instr SSE4.2 operand
32 instr instructions
Double- Decode Int. AVX
precision AES-NI expands to
Vectors 47 instr
256 bit
8/16/32 Video
Improved bit
SIMD
7 instr
64/128-bit Graphics manip.
vector building Encryption
integer blocks and fma
Decryption Vector shifts
Advanced
Enhancements
vector instr Key Gather
8 instr Generation
String/XML
processing
POP-Count
CRC
a[3] a[2] a[1] a[0]
for (i=0;i<MAX;i++) + + + +
b[3] b[2] b[1] b[0]

c[i]=a[i]+b[i];
c[3] c[2] c[1] c[0]
SATG 50
51

Performance Libraries (e.g. IPP and MKL) Ease of use
Compiler: Fully automatic vectorization
Compiler: Auto vectorization hints

(#pragma ivdep, …)
Explicit vectorization with OpenMP* 4.0

and later (SIMD Directive)
SIMD intrinsic class (F32vec4 add)
Vector intrinsic (mm_add_ps())
Assembler code (addps) Programmer control
SATG 51
▪ Use the most advanced Vector ISA ▪ Intel® DPC++/C++ Compiler, Intel®
available on your machine – the C++ Compiler Classic, Intel® Fortran
compiler default is SSE2 Compiler Classic, & Intel® Fortran
▪ Add code pragmas to the non- Compiler (Beta)
vectorized loops in order to feed
compiler with domain knowledge ▪ Use the Intel® Advisor to
• Identify vector code candidates
▪ Re-arrange code to achieve
• Determine vector efficiencies with
• loops with known trip count at runtime
• branch-free loops to increase efficiency
• Determine code efficiencies with the
Roofline Analysis
▪ Tell the compiler to use full 512bit • Step-by-Step guide towards efficient
vector width in case of compute heavy vector code
loops
Programming Guidelines for Vectorization

SATG 52
Advisor Vectorized Loop

Summary (top-left)
Advisor
Roofline
(bottom right)
SATG 53
Data Level Parallelism – legacy instructions
The X87 Coprocessor
▪ FP extension of the 8086 ISA -
later (80486) integrated
▪ Today, x87 instructions rarely
used, If however, numerical results
might differ (80Bit intermediate)!!!
▪ Avoid these instructions in favor
What is / was that? of reproducible result for both,
scalar & vector code
SATG 54
Accelerator / Offloading Parallelism
SATG 58
▪ GPGPU – denser compute

▪ Slice -> SubSlice -> Execution
Units (EUs)
▪ Each EU has
• 8 Threads (similar to SMT)
• 28KB Register File
• 2x 4-wide SIMD units
▪ Offload based Programming
enabled by oneAPI
SATG 59
Data Parallel C++ (DPC++) -
▪ Intel® oneAPI Accelerator
Hello World
#include <CL/sycl.hpp>
using namespace sycl;
Programming
int main() {
std::vector<float> A(1024), B(1024), C(1024);
• C/C++
// some data initialization
{ • Intel ® DPC++ Compiler
buffer bufA {A}, bufB {B}, bufC {C};
queue q;
q.submit([&](handler &h) { • Intel® C++ Compiler with OpenMP
auto A = bufA.get_access(h, read_only); Offloading
auto B = bufB.get_access(h, read_only);
auto C = bufC.get_access(h, write_only);
h.parallel_for(1024, [=](auto i){
C[i] = A[i] + B[i];
• Fortran
});
}); • Intel ® Fortran Compiler with OpenMP
}
for (int i = 0; i < 1024; i++) Offloading
std::cout << "C[" << i << "] = " << C[i] << std::endl;
SATG 60
What can I do? Which Tools? – Intel® oneAPI

▪ Intel® DPC++/C++ Compiler &
▪ Enable your code today for Intel Intel® Fortran Compiler (Beta)
Integrated Graphics GPUs and ▪ Intel® oneMKL
upcoming datacenter GPGPUs
▪ Use the Intel® Advisor to
▪ Reduce data transfers between • Find offloading candidates with the
host and device Offload Advisor
• Estimate code speedup of GPU with
▪ Minimize communication / the Offload Advisor
synchronization in between ▪ Use Intel® VTune™ Profiler to
individual threads • Analyze the offload performance
SATG 61
Offload Advisor – Efficiency VTune – GPU Offload Analysis
SATG 62
SATG 69
Parallelism on the Intel Architecture Supported by the
Intel® oneAPI Base & HPC Toolkits
Intel oneAPI Tools for HPC
Direct Programming API-Based Programming Analysis Tools

Intel® C++ Compiler Intel® MPI Library
Classic Intel® Inspector
Core / Thread Level Parallelism Intel® Fortran Compiler

Classic
Intel® oneAPI
DPC++ Library
Intel® Trace
Analyzer & Collector
Intel® Fortran Compiler Intel® oneAPI

(Beta) Math Kernel Library Intel® Cluster Checker
Thread-Level Parallelism with SMT Intel® oneAPI

DPC++/C++ Compiler
Intel® oneAPI
Data Analytics Library
Intel® VTune™ Profiler
Intel® DPC++ Intel® oneAPI Threading

Compatibility Tool Building Blocks Intel® Advisor
Instruction-level Parallelism Intel® Distribution

for Python*
Intel® oneAPI Video
Processing Library
Intel® Distribution for
GDB*
Intel® FPGA Add-on for Intel® oneAPI Collective
Data Level Parallelism oneAPI Base Toolkit Communications

Library
Intel® oneAPI Deep

Neural Network Library
Accelerator / Offloading Parallelism Intel® oneAPI HPC Toolkit + Intel® Integrated

Performance Primitives
Intel® oneAPI Base Toolkit
SATG 70
Performance Analysis Tools for Diagnosis
Intel® Parallel Studio
N Intel® Trace Analyzer

Cluster Tune MPI & Collector
Application Performance Snapshot

Scalable
? Intel® MPI Tuner
Y
Intel® VTune™ Amplifier’s
Effective Memory
Y Bandwidth N
threading Vectorize
Sensitive
? ?
N Y
Optimize
Thread
Bandwidth
Intel® Intel®
71
VTune™ Profiler Intel® Advisor VTune™ Profiler
SATG 71
72
Summary
Code Modernization is not always ‘easy & quick’ (a.k.a. “rewrite”).
Common methodology is to (1) analyze, then (2) optimize.
Two themes consistently re-occur – Task and Data Parallelism.
To achieve best CPU performance, make sure your programs are

efficiently vectorized and are well parallelized.
Especially the latter one also counts for GPUs.
SATG 72
(Recorded) Tools Webinars
▪ Intel® oneAPI technical Webinar ( 2 days) - co-hosted with Zuse Institute Berlin
▪ oneAPI Webinar on March 2nd and 3rd, 2021
▪ https://www.zib.de/workshops/2021/oneapi
▪ https://www.hlrn.de/doc/display/PUB/Joint+NHR@ZIB+-+INTEL+++oneAPI+Workshop
▪ Intel® oneAPI Rendering Tookit Webinar ( 1 day) :
▪ Rendering 08.06.2021:
https://www.ai-spektrum.de/veranstaltungen/intel-oneapi-rendering-toolkit-workshop.html
▪ Make use of simulated data and virtualize them in high fidelity photo realistic fashion
▪ Intel® oneAPI AI Analytics Toolkit webinar – 1 day :
▪ AI webinar on 15.06.2021:
https://www.ai-spektrum.de/veranstaltungen/artificial-intelligence-on-intelr-platforms-using-intelr-
oneapi-ai-analytics-toolkit-openvino-workshop.html
▪ Intel performance optimized AI tools kits for classic and machine learning
SATG Copyright ©, Intel Corporation. All rights reserved. 73

*Other names and brands may be claimed as the property of others.
Selected Links : Intel Tools User‘s Manuals
• Intel® C++ Compiler Classic Developer Guide and Reference
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top.html
• Intel Data Parallel C++

https://link.springer.com/content/pdf/10.1007%2F978-1-4842-5574-2.pdf
• Intel VTune Cookbook

This Cookbook introduces methodologies and use-case recipes to analyse the performance of your code with VTune Profiler , a tool that
helps you identify ineffective algorithm and hardware usage and provides tuning advice.
https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top.html
• Intel® Advisor Cookbook

The Intel® Advisor Cookbook contains step-by-step instructions to help effectively use more cores, vectorization, or heterogeneous
processing using Intel Advisor.
https://www.intel.com/content/www/us/en/develop/documentation/advisor-cookbook/top.html
• Intel Inspector User Guide(s) for Linux and for Windows

Get Started with Intel® Inspector -Windows* OS . This document explains how to get started with the Intel® Inspector workflows.
Linux :
https://www.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top.html
▪ Windows :
https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-inspector/top/windows.html
• Microsoft Visual Studio* Integration of Intel VTune

https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/launch/microsoft-visual-studio-integration.html

Other useful Links
• Intel Software Developer Manuals
These manuals describe the architecture and programming environment of the Intel® 64 and IA-32 architectures.
Intel® 64 and IA-32 Architectures Software Developer Manuals
• Intel Software Development Tools Magazine ‘Intel Parallel Universe’
https://www.intel.com/content/www/us/en/developer/community/parallel-universe-magazine/overview.html?s=Newest
• Intel Developer Zone
https://www.intel.com/content/www/us/en/developer/overview.html

QUESTIONS?
SATG 76
77

OneAPI Introduction and Parallelization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OneAPI Introduction and Parallelization

Uploaded by

Copyright:

Available Formats

Performance Optimization

for Software Developers

oneAPI & Parallelization on Intel®

Microprocessor frequency Microprocessor frequency

Source: © 2014, James Reinders, Intel, used with permission

Microprocessor frequency Microprocessor frequency

“Parallelism == > Performance”

Optimization – Developer to make sure the

Stage 2: Compile with Architecture-specific

Stage 3: Analysis and Tuning

Stage 4: Check Correctness

Open Industry Intel Product

Scalar Vector Spatial Matrix

Growth in specialized workloads

Variety of data-centric hardware required

Software development complexity limits freedom of

CPU GPU FPGA Other accel.

Scalar Vector Spatial Matrix

Cross-architecture programming that delivers freedom

Compatible with existing high-performance languages

specific functions Libraries

Visit oneapi.com for more details 12

Based on C++ and SYCL Khronos SYCL

Community Project to drive language enhancements

Apply your skills to the next innovation, not to

Intel® oneAPI Threading Intel® oneAPI Collective

Intel® oneAPI DPC++ Library

Open Industry Intel Product

Built on a Rich Heritage of Tools

▪ Accelerates compute by exploiting cutting-edge

▪ Eases transitions to new systems and

Visit software.intel.com/oneapi for more details

Intel® oneAPI DPC++/C++ Compiler (CPU only)

Intel® Fortran Compiler (Beta) Intel® Fortran Compiler Classic

▪ Test drive now & Provide feedback Features &

Performance, Quality, and Support Continues

A core set of high-performance tools for

Native Code Developers

Intel® AI Analytics Intel® Distribution of

Cluster Intel® oneAPI Base & HPC Toolkit

CPU Focused (Core & Xeon) GPU (Intel)

Intel® C++ Compiler Classic Intel® MPI Library Intel® Inspector

Why is this important? Intel® oneAPI Deep Neural

effort; built on industry standards

Learn More: intel.com/oneAPI-HPCKit Back to Domain-specific Toolkits for Specialized Workloads 22

Rendering Toolkit Intel® Embree

Intel® Open Image Denoise

▪ Deliver high-performance, high-fidelity visualization Intel® OpenSWR

▪ Create amazing visual, hyper-realistic renderings via ray

▪ Flexible, cost efficient development Intel® OSPRay

Intel® OSPRay Studio

Toolkit Intel® Optimization for

Intel® Distribution of Modin OmniSci Backend

Top Features/Benefits CPU GPU

Deliver High-Performance Deep Intel® Distribution of OpenVINO™ toolkit

▪ Data scientists Increase Media/Video/Graphics Performance

Back to Domain-specific Toolkits for Specialized Workloads 25

1 Inter- Node Parallelism

2 Intra- Node Parallelism

3 Core / Thread Level Parallelism

4 Thread-Level Parallelism with SMT

6 Data Level Parallelism

7 Accelerator / Offloading Parallelism

1 Inter- Node Parallelism