You are on page 1of 2

Comparing Tesla K40 and Tesla V100

Anon K
December 2020

Abstract
1
NVIDIA Tesla was the name of NVIDIA’s line of products targeted at stream processing or general-
purpose graphics processing units (GPGPU), named after pioneering electrical engineer Nikola Tesla. Its
products began using GPUs from the G80 series, and have continued to accompany the release of new
chips. They are programmable using the CUDA or OpenCL APIs. Two GPUs from the Tesla series,
namely Tesla K40 and Tesla V100 are compared here.

Tesla K40 Tesla V100


Chip GK180 (Kepler) GV100 (Volta)
Release year 2012 2017
SMx 15 80
TPCs 15 40
FP32 cores per SM 192 64
FP32 cores per GPU 2880 5120
FP64 cores per SM 64 32
FP64 cores per GPU 960 2560
Tensor cores per SM N/A 8
Tensor cores per GPU N/A 640
Peak clock speed 875 MHz 1530 MHz
Peak FP32 TFLOPS 5 15.7
Peak FP64 TFLOPS 1.7 7.8
Peak tensor TFLOPS N/A 125
Bus width 384-bit 4096-bit
Bandwidth 288 GB/s 900 GB/s
Global memory 12 GB 16 GB
L2 cache 1536 KB 6144 KB
Shared memory size per SM 16KB/32KB/48KB Configurable up to 96 KB
Register file size per SM 256 KB 256 KB
Register file size per GPU 3840 KB 20480 KB
Transistors 7.1 billion 21.1 billion

Table 1: K40 vs V100

1 K40
The chip present in Tesla K40 is GK180 which has NVIDIA’s Kepler micro-architecture. Kepler is the
codename for a GPU micro-architecture developed by NVIDIA, first introduced at retail in April 2012, as
the successor to the Fermi micro-architecture. The K40 GPU uses a single GK180 chip. The K40 GPU was
launched in October 2013.
1 Nvidia
retired the Tesla brand in May 2020, because of potential confusion with the brand of cars. Its new GPUs are
branded NVIDIA Data Center GPUs

1
2 V100
The chip present in Tesla V100 GPU is GV100 which has NVDIA’s Volta micro-architecture. Volta is the
codename for a GPU micro-architecture developed by NVIDIA, succeeding Pascal.The first graphics card to
use it was the data center Tesla V100. Tesla V100 contains a single GV100 chip. V100 GPU was released
in June 2017.

3 Additional features in V100


In the full GV100 chip there are 6 graphic processing clusters, each containing 14 streaming multiprocessors
(In K40, the GPU is not divided into processing clusters). Each SM contains 4 processing blocks each with
its own L0 instruction cache. V100 is the first GPU containing tensor cores. Each streaming multiprocessor
has 8 tensor cores. There are two tensor cores per processing block (i.e., 2 × 4 = 8 per SM). Each tensor core
performs 64 floating point operations per clock. Each tensor core performs multiplication and accumulation
of 4 × 4 matrices (i.e., D = A × B + C, where A, B, C are 4 × 4 matrices).
Another difference between V100 and K40 at the SM level is in the L1 cache. In K40, there are two memory
units at the shared memory level, namely the shared memory/L1 cache, read only data cache. Whereas,
V100 contains has a combined shared memory and L1 data cache. Shared memory provides high bandwidth,
low latency, and consistent performance (no cache misses), but the CUDA programmer needs to explicitly
manage this memory. Volta narrows the gap between applications that explicitly manage shared memory
and those that access data in device memory directly. By combining the two, L1 cache operations attain
the benefits of shared memory. The combined capacity is 128 KB/SM, and all of it is usable as a cache
by programs that do not use shared memory. While shared memory remains the best choice for maximum
performance, the new Volta L1 design enables programmers to get excellent performance quickly, with less
programming effort. While the programmer has control over the L1 cache size, more freedom is offered by
V100. In K40, out of the 64 KB of unified L1 cache and share memory, only three configurations (shared
memory size 16/32/48) are allowed. Whereas in V100, out of 128 KB, shared memory can be configured up
to 94 KB. Prior NVIDIA GPUs only performed load caching, while GV100 introduces writecaching (caching
of store operations) to further improve performance.
Unlike GK180 GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta
GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32
operations at full throughput. Many applications have inner loops that perform pointer arithmetic (integer
memory address calculations) combined with floating-point computations that will benefit from simultaneous
execution of FP32 and INT32 instructions. Each iteration of a pipelined loop can update addresses (INT32
pointer arithmetic) and load data for the next iteration while simultaneously processing the current iteration
in FP32.
V100 uses NVLink interconnect. NVLink is NVIDIA’s high-speed interconnect technology first introduced
in 2016 with the Tesla P100 accelerator and Pascal GP100 GPU. NVLink provides significantly more perfor-
mance for both GPU-to-GPU and GPU-to-CPU system configurations compared to using PCIe interconnects.
The memory interface used by V100 is HBM2 and that used by K40 is GDDRS. HBM2 memory is composed
of memory stacks located on the same physical package as the GPU, providing substantial power and area
savings compared to traditional GDDR5 memory designs, thus permitting more GPUs to be installed in
servers.
Both V100 HBM2 memory subsystem and K40 GDDR5 memory subsystem support Single-Error Correcting
Double-Error Detecting (SECDED) Error Correction Code (ECC) to protect data. ECC provides higher
reliability for compute applications that are sensitive to data corruption. In K40, some additional bandwidth
is consumed for ECC enabled operation. With V100, ECC can be active without the bandwidth penalty.

You might also like