AcceleratingAIAdvancements Pre Print Doube Blind

ACCELERATING AI A DVANCEMENTS
A P REPRINT
June 5, 2023
A BSTRACT
While new architectures of GPUs increase the maximum number of FLOPs they can deliver, the vast
majority of processes are incapable of fully utilizing the GPU to it’s full potential. We review the
concept of fractionalizing GPUs, leading to the increase in performance. We develop the metric for
determining whether the multiple tasks are adequately matched. We validate the proposed metric
against Arc Compute’s software and demonstrate that properly matched tasks see an increase in
performance up to 200%.
Keywords BLAS · Utilization · Performance · GPGPU · CUDA Partitioning · Program Co-Run
1 Introduction
We exclusively focus on NVIDIA GPUs; however, the same line of reasoning can be expanded to cover other accelerator
chips. In GPUs, with Compute Context of above 7.0, we have around 174 distinct metrics that are available to
programmers NVIDIA [2023a]. Even more metrics are available inside the recently open sourced NVIDIA kernel
module. Programmers of NVIDIA GPUs can only partially achieve the full potential of the GPU, no matter how
optimized their code is. In this paper, we discuss Memory bandwidth vs. Compute bandwidth; similarly other resources
available in the GPU can be analyzed.
Various approaches to CUDA kernel matching have been examined through papers in the past Jain et al. [2019], Zhong
and He [2013], López-Albelda et al. [2021]. In summary, Kernelet’s approach through Markov chaining is possible to
accomplish PTX task matching, however, issues of uneconomical CPU costs are introduced. The approaches proposed
by Jain et al. [2019] and López-Albelda et al. [2021] are outdated and do not handle newer versions of the NVIDIA
driver. Our proposal GVM accomplishes PTX task matching without outlined drawbacks.
At any given time, a CUDA Kernel is either moving data to a higher/lower level memory architecture or performing
some execution on data already located in the register files of the GPU. To help demonstrate this, we will first explain
the compilation structure of CUDA code and follow up by providing memory access times for different sections of the
GPU. We will then identify internal/external GPU scheduling schemes which are the limiting factors towards scheduling
multiple CUDA Kernels at once. Our analysis will be continued by looking into different GPU BLAS optimizations as
they are the basis of all neural networks. We will use this analysis to build an affinity score for matching multiple GPU
kernels, which will be validated by running different-sized SGEMM operations on the same NVIDIA A100 GPU.
1.1 CUDA Compilation Pipeline
CUDA programs are traditionally split into 2 sections, a CPU based section and a device specific section. The device
specific section is first compiled into “Parallel Thread eXecution” (PTX) code, and later translated into architecture
specific SASS code. Refer to Figure 1 to get a visual explanation of the process.
Accelerating AI Advancements A P REPRINT
Parse Device Source Code
PTX Binary Representation
Compile To Architecture Specific
SASS Binary Representation
To GPU
Figure 1: GPU compilation pipeline.
The SASS code is uploaded and executed on the GPU in question; however, for effective code, some dead cycles will
always be present in the compiled SASS. These dead cycles occur due to time wasted while waiting for data from global
memory on the GPU.
1.2 GPU Memory Pipeline
In line with traditional CPUs/Microcontrollers, a GPU cannot directly access memory from the global memory. While
some commands in the Ampere series and above allow for the bypassing of L1 cache memory, they require the data to
come down to an L2 cache before hitting the global memory on the GPU. Previous works have provided information
regarding GPU latency from both a bandwidth and clock cycle perspective on computation speed Abdelkhalik et al.
[2022], van Stigt et al. [2022]. To help add perspective, we have provided the table for cycles per instruction for A100s
in Table 1.
Memory Type CPI (cycles)

Global memory 290
L2 cache 200
L1 cache 33
Shared Memory (ld/st) (23/19)
Table 1: GPU Memory access latencies from Abdelkhalik et al. [2022] in cycles per instructions (CPI).
Combining this knowledge, with the fact, a multiply add (MAD) instruction can take 2 to 11 GPU cycles (Abdelkhalik
et al. [2022]) to complete, we can safely schedule between 18 (11 CPI) and 100 (2 CPI) MAD instructions in the span
of a single L2 cache memory access. Now that we have the basis for identifying how long a PTX/SASS instruction will
take on the Ampere architecture, we can begin the process of identifying instructions which can be coscheduled. This
process requires us to dissect a CUDA program and identify how exactly a kernel gets scheduled.
1.3 GPU Scheduling
Considerable research has been conducted on the real-time scheduling properties of GPUs and CUDA Kernels Yang
[2020], López-Albelda et al. [2021]. To begin this discussion let us examine Figure 1.3.
2
Figure 2: Structure of the Ampere SMNVIDIA [2023b]
From Figure 1.3, we start by noticing that the L1 cache between different warps is shared together. There is also a
dispatch unit with each warp scheduler. A dispatch unit is responsible for scheduling the instructions selected by the
warp. There are two other units not mentioned in the diagram called the “Load Store Unit” (LSU) and the “Memory
Input Output” (MIO). These handle copying/moving data up/down and the execution of data in the register files. This
information lays the groundwork for an in-depth explanation of how the components interact together.
A CUDA program is split into blocks and threads and is scheduled on a stream. Streams have different priorities,
which may also affect when the blocks get scheduled Yang [2020], Amert [2017]. Multiple streams may execute
simultaneously, provided favourable conditions exist. The NULL stream is treated as a low priority stream when
different priority streams are created. Luckily, it seems that CUDA streams are a software concept and is divergent from
the physical hardware. Each block has a schedulable window of time and a set number of threads that are required to
execute on it.
The NULL stream causes an implicit synchronization of all CUDA contexts running at the moment. This forces all the
kernels/memory copies which would otherwise be asynchronous to become synchronous.
In bare metal situations, there is a bottleneck in the PCI bus which causes concurrent kernels to have a false data
dependance on each other. This is mitigated through the use of SR-IOV.
Our proposal for the metric to ensure the schedulability of common kernels relies on the time to execute a single block:
3
Nthreads ∗ Tblock
Tsched = (1)
Nwarps ∗ Nwarpthreads
• Tsched : Time to complete 1 scheduled block.
• Nthreads : Number of threads per block.
• Tblock : Average time to complete 1 block in term of threads.
• Nwarps : Number of warps per SM.
• Nwarpthreads : Number of threads that can be allocated per WARP.
Each block is allocated ucode (PTX/SASS) code that is executed on the VM. To our knowledge, this is a faithful
representation of the device section of the CUDA code.
1.4 BLAS Experiments
Now that we have described the basis of computing on the GPU, the problem we are looking to solve is having a
small SGEMM matched with a large SGEMM operation. The hypothesis is that the large SGEMM is memory bound,
while the small SGEMM is compute bound. To minimize errors in our own code, and to avoid redoing the work done
by Boehm [2023], we have opted to use cuBLAS. Using cuBLAS also allows us to show this particular mechanism can
perform on highly optimized codebases.
To prevent the synchronization issue in Section 1.3, we use virtualization and present half of the GPU to each side
uniquely. There are only three major approaches to performing this task: MIG, vGPU, GVM. While other approaches
like GViM Gupta et al. [2009], FlexSchedLópez-Albelda et al. [2021], Fractional GPUsJain et al. [2019], KerneletZhong
and He [2013], Backend.AIBackend.AI [2023], and Run.AI’s “Fractional GPU”Run.AI [2023] do exist, they have been
disregarded due to either being outdated or insufficient at solving implicit synchronization issues between multiple
GPU users. Some are better than others and do not require source code access; however, that is a paper for a different
time. We are also not looking at MPS, as the majority of the previous work was compared to MPS quite a bit.
Presenting only half of the GPU while utilizing MIG will often lead to a substantial slowdown. As a result, we define
the following measure to evaluate a performance increase regarding MIG:
Tshared
P = (2)
N ∗ Tf ull
• P : Relative performance increase.
• N : Number of concurrent tasks scheduled.
• Tshared : Shared time to complete 1 full CUDA call for all blocks.
• Tf ull : Full passthrough time to complete full CUDA call.
This formula takes into account that on average we would only see N1 amount of total SMs. Recalling Section 1.3, we
know that all CUDA calls are scheduled in groups of blocks. By lowering the number of SMs available, we effectively
increase the time for completing a CUDA kernel. In the measurement of time a 1-to-1 performance of blocks correlates
in N times when referenced against full passthrough.
The objective is to simulate how real-world applications would be affected through sharing based on time scheduling
and virtualization. To gather such information we developed the following set of 10 tests:
4
Case # Small SGEMM Large SGEMM

0 (1024, 1024) * (1024, 864) (1024, 1024) * (1024, 864)
1 (512, 1024) * (1024, 864) (2048, 1024) * (1024, 864)
2 (256, 1024) * (1024, 864) (4096, 1024) * (1024, 864)
3 (128, 1024) * (1024, 864) (8192, 1024) * (1024, 864)
4 (64, 1024) * (1024, 864) (16384, 1024) * (1024, 864)
5 (64, 1024) * (1024, 864) (32768, 1024) * (1024, 864)
6 (64, 1024) * (1024, 864) (65536, 1024) * (1024, 864)
7 (64, 1024) * (1024, 864) (131072, 1024) * (1024, 864)
8 (64, 1024) * (1024, 864) (262144, 1024) * (1024, 864)
9 (64, 1024) * (1024, 864) (524288, 1024) * (1024, 864)
To help our analysis, we created the following list of kernels, internally called by the Small and Large SGEMMs.
Case # Small Kernel Large Kernel

0 ampere_sgemm_128x32_nn ampere_sgemm_128x32_nn
1 ampere_sgemm_128x32_nn ampere_sgemm_128x32_nn
2 ampere_sgemm_64x32_sliced1x4_nn ampere_sgemm_128x32_nn
3 ampere_sgemm_64x32_sliced1x4_nn ampere_sgemm_128x32_nn
4 ampere_sgemm_64x32_sliced1x4_nn + splitKreduce_kernel ampere_sgemm_128x32_nn
We will propose the metric for determining kernel affinity for coscheduling in Section 2.
2 Metric
With the necessary background knowledge we can create and compare our metric to other metrics that are available for
GPU task matching. The first approach was done by the Kernelet project Zhong and He [2013]. The Kernelet approach
defines “Pipeline Utilization Ratio” (PUR), as well as “Memory-bandwidth Utilization Ratio” (MUR). These two ratios
are listed below:
Instruction_Executed
PUR =
T ime ∗ F requency ∗ P eak_IP C
Dram_Reads + Dram_W rites
MUR =
T ime ∗ F requency ∗ P eak_M P C
The P eak_IP C refers to the maximum instructions per cycle possible during a single instruction. Whereas
P eak_M P C refers to the maximum memory bandwidth executed per cycle. These ratios provide metrics by which we
can analyze a single instruction and determine an initial weight for execution of the instruction. However, they do not
take into account L2/L1 reads/writes or even the memory hierarchy inside the GPU. So we propose slightly different
metrics for the base ratio.
Σwarpsize
i=1 IWi ∗ αIWi ,UIW
i
PUR = (3)
ΣU
i=1 βUi
size
Σwarpsize
i=1 IWi ∗ γIWi ,MIW
i
MUR = (4)
ΣM
i=1 σMi
size
• I: The set of all instructions which are being looked at by the warp scheduler at a given moment in time.
5
• Wi : The index at warp thread i that is currently being accessed.

• α: Heuristics for sub core utilizations, this is indexed by both an instruction type as well as the primary
utilization pipeline.
• U : The set of all utilization pipelines inside the sub core of the SM.
• β: Set of heuristics associated with the maximum utilization of each sub core pipeline.
• γ: Heuristics for memory utilizations — this is indexed by both an instruction type as well as memory
utilization pipeline.
• M : The set of all memory utilization pipelines.
• σ: Set of heuristics relating the maximum utilization of the memory pipeline in question.
As opposed to looking at each instruction individually, we look at all the instructions that the warp can schedule at
a given moment in time. Blocks from multiple unique streams can be scheduled at the same time, Harris [2023],
López-Albelda et al. [2021]. To simplify the discussion we focus solely on full blocks that can be concurrently
scheduled. We will touch on how it is possible to use this approach to also select slices to perform maximum utilization.
Each block has a full copy of the PTX/SASS code that was uploaded to the GPU. A block is completed when the
entire CUDA kernel is completed for all threads. As a result, we can modify α and σ to be dynamic heuristics that are
specified based on the previous N instructions that were submitted by the kernel in question. This penalizes kernels
that attempt to max out the bandwidth of a particular pipeline.
This allows us to start matching high P U R kernels with high M U R kernels. Using this methodology, we will maximize
throughput on both the MIO and LSU in the SM’s sub-core.
A drawback is that this does not properly handle branching inside the kernels. If we analyze any of the kernels in
Table 1.4, we will notice there are a few branch commands inside the resulting PTX/SASS code. These branches can
modify a high P U R kernel, into a high M U R kernel depending on the data that is being run. For this reason, we need
to analyze the kernel before the first run to identify the constant table inside the PTX/SASS.
By performing the static analysis ahead of time, we can model if the executing kernel will have a high P U R or a high
MUR
A series of stochastic models is currently being evaluated to identify the best model to use to provide the best utilization
metrics. This will be the subject of future papers. These models are also the tools that are used to select either static, or
dynamic heuristics in equations 3 and 4. The argument for the use of dynamic heuristics can be seen from Abdelkhalik
et al. [2022], where repeated instructions incur a lower CPI on average.
The models used in this paper are a simple exponentially weighted moving average for the α, and a Hidden Markov
Model for the σ. The reasoning is for core accesses we, simply need to avoid saturation; however, for memory accesses,
we need to also model the L1/L2 cache misses. Similar work has been performed by Wang et al. [2018]. We also benefit
from having the same core execute the same blocks by having coalesced memory accesses.
The described metric is used inside the proprietary software developed by Arc Compute, to identify the optimal GPU to
coschedule processes on.
3 Pseudo Code
We outline now the testing methodology. To remove physical medium from the equation, we made a circular buffer of
random data inside the GPU vRAM allocated to the VM. The allocated vRAM is then accessed. This is used to prevent
any excess memory being left in the L2/L1 caches of the GPU. Both matrices are located in the global memory of the
GPU. Iterating through the mapping of each SGEMM, we can provide real batch training/inference times for large and
small-scale neural network layers.
1. Load circular buffer.

2. Iterate for N samples.
3. Create timer.
4. Perform SGEMM.
5. Stop timer.
6. Average timers over N samples.
6
While one side is measured, the other is performing the test for N = ∞ samples. To prevent issues in the startup cost,
we start the infinite version of the code and allow it to reach a steady state before running the measured version of the
code. Overall we noticed that the infinite side maintains the same average time for completion as the measured version.
4 Results
The table below provides the results of our experiments utilizing our software, GVM.
Case # Large SGEMM Passthrough Small SGEMM Passthrough Large SGEMM Shared Small SGEMM Shared
Average: 0.153 ms/iter Average: 0.142 ms/iter Average: 0.114 ms/iter Average: 0.114 ms/iter
0 Minimum: 0.146 ms/iter Minimum: 0.120 ms/iter Minimum: 0.113 ms/iter Minimum: 0.112 ms/iter
Maximum: 0.170 ms/iter Maximum: 0.160 ms/iter Maximum: 0.127 ms/iter Maximum: 0.124 ms/iter
We have noticed that the code had a brief period of instability in the first 3 iterations, after which the variance became
inconsequential. In Table 4 we see the associated min-max values for the experiment. We also notice that in case
numbers 0-2 we have a lower peak-to-peak even in the first 3 iterations. We determined this is due to the lower startup
cost for running common instructions on the same pipelines as was noticed in Abdelkhalik et al. [2022]. From cases of
3 and above for the large SGEMM program, we noticed that there was quite a bit of L2 cache contestation. This cleared
up later in the program’s runtime.
For cases above 3, the large SGEMM model is already over 300% of the A100 L2 cache size. This explains the
slowdown in the percentage increase with respect to MIG. The small SGEMM codebase in cases 0 and 1 have a roughly
even M U R and P U R coefficient. With the major difference between the large SGEMM and small SGEMM being
changed is the number of blocks the kernel is called with. Within our model the large SGEMM kernel for case 1 has a
higher M U R than case 1 for the small SGEMM kernel. This allows them to have a speed increase, as well as explains
why the speed per iteration between case 0 and case 1 nearly halved when we halved the data size.
Regarding the percentage increase with respect to MIG we obtain the following:
7
Case # Large SGEMM Small SGEMM

0 169.244% 150.352%
1 151.321% 171.145%
2 132.026% 137.917%
3 111.625% 182.985%
4 63.060% 207.080%
5 45.524% 207.556%
6 42.658% 207.489%
7 41.861% 206.608%
8 40.505% 206.608%
9 39.673% 206.608%
We were able to validate both MIG and vGPU from NVIDIA. In the case of MIG, performance has decreased
to approximately half of the full GPU’s passthrough. In static instances of CUDA kernels, vGPU performs with
similar results to GVM; however, when we introduce dynamic kernel changes the vGPU system is unable to improve
performance. We see a dynamic kernel change from cases 4 and above; the small SGEMM operation runs two separate
kernels in the span of a single kernel by the large SGEMM operation.
5 Utilizations
We also measured in a safe manner the utilization results for Case #0 to Case #5. These are included below:
Case # Passthrough Large SGEMM Passthrough Small SGEMM Shared Large SGEMM Shared Small SGEMM
SM Utils SM Utils SM Utils SM Utils
82% 82% 54% 44%
0
HBM2 Bandwidth HBM2 Bandwidth HBM2 Bandwidth HBM2 Bandwidth
8% 8% 2% 1%
89% 73% 58% 40%
1
8% 7% 2% 2%
93% 70% 71% 27%
2
13% 7% 6% 2%
96% 60% 82% 16%
3
46% 6% 29% 6%
97% 50% 90% 8%
4
85% 6% 60% 5%
98% 48% 76% 22%
5
15% 6% 12% 3%
It is interesting to note that while SM core utilization goes up when one shares two jobs, the HBM2 Bandwidth decreases.
In majority of cases the summation of both HBM2 Bandwidths is lower than the single passthrough approach.
All cases above case 5, have a form of resonance which is a bit hard to model, future papers will explain the reason for
the resonance.
8
6 Conclusion
From our experimentation we have noticed that we have been able to consistently speed up the small SGEMM
operation, and in a lot of situations speed up the large SGEMM operation as well. We also noticed that even though
we speed up computation, in a lot of cases we actually decreased HBM2 bandwidth, as well as the individual SM
utilization/occupancy has actually decreased as well. This leads us to believe that there are coalesced memory access
optimizations that can be further taken advantage of.
References
NVIDIA. Cupti metrics api, May 2023a. URL https://docs.nvidia.com/cupti/r_main.html#r_metric_api.
Saksham Jain, Iljoo Baek, Shige Wang, and Ragunathan Rajkumar. Fractional gpus: Software-based compute and
memory bandwidth reservation for gpus. In 2019 IEEE Real-Time and Embedded Technology and Applications
Symposium (RTAS), pages 29–41, 2019. doi:10.1109/RTAS.2019.00011.
Jianlong Zhong and Bingsheng He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and
scheduling. CoRR, abs/1303.5164, 2013. URL http://arxiv.org/abs/1303.5164.
Bernabé López-Albelda, Francisco Manuel Castro, José María González-Linares, and Nicolás Guil Mata. Flexsched:
Efficient scheduling techniques for concurrent kernel execution on gpus. The Journal of Supercomputing, 78:43–71,
2021.
Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed Badawy. Demystifying the nvidia ampere
architecture through microbenchmarking and instruction-level analysis, 2022.
Rico van Stigt, Stephen Nicholas Swatman, and Ana-Lucia Varbanescu. Isolating gpu architectural features using
parallelism-aware microbenchmarks. In Proceedings of the 2022 ACM/SPEC on International Conference on
Performance Engineering, ICPE ’22, page 77–88, New York, NY, USA, 2022. Association for Computing Machinery.
ISBN 9781450391436. doi:10.1145/3489525.3511673. URL https://doi.org/10.1145/3489525.3511673.
Ming Yang. SHARING GPUS FOR REAL-TIME AUTONOMOUS-DRIVING SYSTEMS. Phd thesis, University of
North Carolina, Chapel Hill, 2020. Available at https://www.cs.unc.edu/~anderson/diss/mingdiss.pdf.
NVIDIA. Ampere whitepaper, May 2023b. URL https://images.nvidia.com/aem-dam/en-zz/Solutions/
data-center/nvidia-ampere-architecture-whitepaper.pdf.
Tanya Amert. Gpu scheduling on the nvidia tx2: Hidden details revealed. 2017. URL https://www.cs.unc.edu/
~anderson/papers/rtss17c.pdf.
Simon Boehm. How to optimize a cuda matmul kernel for cublas-like performance: a worklog, May 2023. URL
https://siboehm.com/articles/22/CUDA-MMM.
Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche, Niraj Tolia, Vanish Talwar, and
Parthasarathy Ranganathan. Gvim: Gpu-accelerated virtual machines. In Proceedings of the 3rd ACM Work-
shop on System-Level Virtualization for High Performance Computing, HPCVirt ’09, page 17–24, New York, NY,
USA, 2009. Association for Computing Machinery. ISBN 9781605584652. doi:10.1145/1519138.1519141. URL
https://doi.org/10.1145/1519138.1519141.
Backend.AI. Gpu virtualization and fractional gpu allocation, May 2023. URL https://console.docs.backend.
ai/en/docs-essential-guide-r2/allocate_gpu/allocate_gpu.html.
Run.AI. Quickstart: Launch workloads with gpu fractions, May 2023. URL https://docs.run.ai/Researcher/
Walkthroughs/walkthrough-fractions/.
Mark Harris. How to query device properties and handle errors in cuda c/c++, June 2023. URL https://developer.
nvidia.com/blog/how-query-device-properties-and-handle-errors-cuda-cc/.
Haonan Wang, Fan Luo, Mohamed Ibrahim, Onur Kayiran, and Adwait Jog. Efficient and fair multi-programming in
gpus via effective bandwidth management. In 2018 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pages 247–258, 2018. doi:10.1109/HPCA.2018.00030.

AcceleratingAIAdvancements Pre Print Doube Blind

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AcceleratingAIAdvancements Pre Print Doube Blind

Uploaded by

Copyright:

Available Formats

ACCELERATING AI A DVANCEMENTS

Keywords BLAS · Utilization · Performance · GPGPU · CUDA Partitioning · Program Co-Run

1.1 CUDA Compilation Pipeline

Parse Device Source Code

PTX Binary Representation

Compile To Architecture Specific

SASS Binary Representation

Figure 1: GPU compilation pipeline.

1.2 GPU Memory Pipeline

Memory Type CPI (cycles)

1.3 GPU Scheduling

Figure 2: Structure of the Ampere SMNVIDIA [2023b]

• Tsched : Time to complete 1 scheduled block.

• Nthreads : Number of threads per block.

• Tblock : Average time to complete 1 block in term of threads.

• Nwarps : Number of warps per SM.

• Nwarpthreads : Number of threads that can be allocated per WARP.

1.4 BLAS Experiments

• P : Relative performance increase.

• N : Number of concurrent tasks scheduled.

• Tf ull : Full passthrough time to complete full CUDA call.

Case # Small SGEMM Large SGEMM

Case # Small Kernel Large Kernel

• Wi : The index at warp thread i that is currently being accessed.

1. Load circular buffer.

Case # Large SGEMM Small SGEMM

You might also like