You are on page 1of 6

EETimes - MLPerf Training Scores: Microsoft Demonstrates Fastest Clo... https://www.eetimes.com/mlperf-training-scores-microsoft-demonstrates...

| AI & BIG DATA < https://www.eetimes.com/designline/ai-big-data-


DESIGNLINES
DESIGNLINE designline/>

MLPerf Training Scores: Microsoft Demonstrates Fastest Cloud AI


By Sally Ward-Foxton < https://www.eetimes.com/author/sally-ward-foxton/> 12.01.2021 0

In the latest round of MLPerf AI training benchmarks, Microsoft Azure demonstrated the world’s fastest cloud for
AI using large-scale Nvidia powered instances. Azure’s NDm A110 v4 series of virtual machines ran
benchmarks on up to 2,048 Nvidia A100-80GB GPUs, completing each benchmark in under 18 minutes.

Nvidia led on seven of the eight benchmarked workloads in the closed division with systems containing up to
4,320 A100 accelerators. Microsoft Azure topped the eighth category (medical imaging) with its Nvidia-powered
cloud instance. Graphcore and Habana Labs also submitted improved results for ResNet-50 and BERT
benchmarks.

Microsoft Azure

Microsoft’s Azure’s MLPerf submission is ranked tenth among the world’s top 100 supercomputers. Nvidia’s in-
house AI supercomputer, Selene, is about twice the size and currently ranks sixth.

Azure’s NDm A110 v4 series of virtual machines offers scalability from 1 to more than 256 virtual machines, or
from 8 to 2,048 GPUs, as required. The 2,048 GPUs used in the Azure cloud demonstrated the ability to train
an entire BERT natural language processing model in just over 25 seconds. The most difficult benchmark,
MiniGo, was trained in under 17.5 minutes using 1,792 GPUs. Azure topped the 3D Unet benchmark, used for
three-dimensional medical images, with a training time of 1.262 minutes using 768 GPUs (Nvidia’s 768-GPU
result for 3D Unet was 1.373 minutes).

Among Microsoft’s goals was demonstrating that Azure cloud performance is comparable to on-premises
equipment.

Nvidia

Nvidia’s submissions were designed to demonstrate the company’s capabilities for large-scale AI training.

“Scaling to larger clusters is really the hardest part of training AI, and it’s one where Nvidia’s AI platform has
tremendous strengths,” claimed Paresh Kharya, Nvidia’s senior director of product management for accelerated
We use cookies
computing. on isour
“Scaling website
really to give
important you the
because most relevant
everything becomesexperience
a bottleneck.byIt’sremembering your
a very hard problem.
preferences and repeat visits. By clicking “Accept All”, you consent to the use
From distributing work, coordinating work to moving data, everything becomes a bottleneck.”
of ALL the cookies.
However, you may visit "Cookie Settings" to provide a controlled consent.
Training huge, cutting-edge models can take months, even on Selene, Kharya said, adding that advancing
Cookie Settings
state-of-the-art Accept
AI models Allbe impossible without scaling.
would

Scale is also important, Kharya said, since the ability to iterate fast on AI projects is vital. “One of the common

1 of 6 12/5/2021, 10:15 PM
EETimes - MLPerf Training Scores: Microsoft Demonstrates Fastest Clo... https://www.eetimes.com/mlperf-training-scores-microsoft-demonstrates...

misperceptions we see is to use just the cost of the infrastructure for the [return on investment] for training
models,” he added. Users “care about the cost of infrastructure, but also the productivity of their expensive data
science teams, and ultimately the time to bring their products and updates to their products to market faster
than the competition.”

Benchmarks run on Selene scaled to up to 4,320 GPUs, the largest system in this round. Nvidia said the results
represent a 30-fold speed increase compared to the fastest Graphcore system (256 accelerators) and 53 times
faster than results for Habana Labs’ biggest system (also 256 accelerators).

< https://www.eetimes.com/wp-content/uploads/Nvidia-graph1-1.jpg>
Time to train for all benchmarks; smaller is better. The results compare systems with
different numbers of accelerators. Google TPU v4 results from the previous round of
MLPerf scores are shown for comparison. (Source: Nvidia)

< https://www.eetimes.com/wp-content/uploads/Nvidia-graph2-1.jpg>
We use cookies on our website to give you the most relevant experience by remembering your
Performance normalized to per-accelerator chip; higher is better. Normalized to
preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies.
However, you mayofvisit
performance Nvidia A100.
"Cookie GoogletoTPU
Settings" v4 results
provide fromconsent.
a controlled the previous round of MLPerf
scores are shown for comparison. (Source: Nvidia)
Cookie Settings Accept All
As for per-accelerator chip performance, Nvidia claimed victory over Graphcore and Habana Labs accelerators,
though it trailed Google TPU v4’s ResNet-50 score from the previous round of training benchmarks.

2 of 6 12/5/2021, 10:15 PM
EETimes - MLPerf Training Scores: Microsoft Demonstrates Fastest Clo... https://www.eetimes.com/mlperf-training-scores-microsoft-demonstrates...

Nvidia also noted its steadily improving scores. Compared to MLPerf Training scores from July 2020 <
https://www.eetimes.com/nvidia-google-both-claim-mlperf-training-crown/> (when the A100 was
introduced), Nvidia A100-based systems performed five times faster at scale and twice as faster at the chip
level.

Software changes account for the performance gains, including CUDA graphs that reduce CPU bottlenecks by
launching the entire sequence of kernels simultaneously rather than serially. Hence, the full training iteration ran
directly on GPUs. CUDA streams improved parallelism by introducing a fine-grained overlap of computation and
communications.

Nvidia’s NCCL and SHARP technologies < https://docs.mellanox.com/display/sharpv243


/Using+NVIDIA+SHARP+with+NVIDIA+NCCL> were used to improve multi-GPU and multi-node processing.
NCCL optimizes data aggregation based on available bandwidth and network latency. SHARP improves
performance by offloading operations from the CPU onto the switch, eliminating the need to send data multiple
times between different endpoints and servers. Meanwhile, an updated MX network implementation improved
the efficiency of memory copies for operations like concatenation and split.

Graphcore

Graphcore demonstrated scaling on larger systems, including those with 128 and 256 IPU accelerators.

For 16- and 64-accelerator systems, Graphcore’s ResNet-50 scores improved 24 percent on the IPU-Pod16
and 41 percent on the IPU-Pod64. For BERT, IPU-Pod16 scores improved 5 percent and IPU-Pod64 scores
rose 12 percent. Again, software optimization helped boost performance.

Graphcore’s results compare its IPU-Pod16 performance to Nvidia’s DGX-A100, even though the Graphcore
platform includes twice the number of accelerator chips. Graphcore maintained the systems are equivalent in
size (the IPU-Pod16 is 5U versus the DGX-A100 in 6U) and roughly equivalent on power consumption and
price. It should be noted that Graphcore is the only company to use this comparison <
https://www.eetimes.com/graphcore-challenges-nvidia-with-in-house-benchmarks/> . Graphcore claimed
its IPU-Pod16 outperformed Nvidia’s DGX-A100 on ResNet-50 (28.3 minutes to train on Graphcore; 29.1
minutes to train on Nvidia).

Graphcore’s BERT scores reflect systems with fewer host CPUs per accelerator than ResNet-50. BERT scores
were benchmarked on systems with one host CPU per 32 IPUs, while ResNet-50 scores were benchmarked on
systems with one host CPU per 8 IPUs.

“We have the flexibility to vary this property per workload, which is unusual,” said Dave Lacey, Graphcore’s
chief software architect. “That enables us to experiment… and get these points of efficiency.”

Lacey added that this approach allows users to perform more computing on a single host server without moving
to distributed CPU computation that requires additional infrastructure.

“This is also an important factor of cost,” Lacey said. “All these systems have very hefty CPUs on them, and
that’s a significant cost to your system. If you can get away with the best ratio, the smallest number of CPUs,
the accelerators are really doing the very heavy lifting here. Then that cost optimizes best for that particular
We use cookies on our website to give you the most relevant experience by remembering your
workload.”
preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies.
However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

3 of 6 12/5/2021, 10:15 PM
EETimes - MLPerf Training Scores: Microsoft Demonstrates Fastest Clo... https://www.eetimes.com/mlperf-training-scores-microsoft-demonstrates...

< https://www.eetimes.com/wp-content/uploads/Graphcore-host-to-accelerator-graph.jpg>
Graphcore’s accelerators require fewer host CPUs per accelerator for BERT training.
(Source: Graphcore) (click to enlarge)

Lacey said Graphcore made a deliberate design choice for its IPU to push application logic onto the accelerator.
The connection between host and accelerator is only used for training data – no code, no heavy
synchronization, just data, he added.

Another issue is reducing the number of host CPUs depending on workloads and data used by a workload. “It
[depends on] how much preparation or other non-AI type tasks are being done on the on the CPU, and also
how much is traveling between the CPU and the accelerator,” Lacey said.

The effect is particularly pronounced for BERT workloads where the input data is much smaller than the images
required for other workloads. Image processing workloads like ResNet-50 require additional non-AI tasks like
image decompression which is better suited to the host CPU. Hence, more hosts are required.

Ethernet connections between host and accelerator also provide flexibility to reconfigure the number of host
CPUs accordingly.

Graphcore’s comparisons for the ratio between host CPUs and accelerators are based on one Graphcore chip
to one Nvidia or Habana chip. If a single Graphcore IPU-Pod16 equals a single Nvidia DGX-A100, as
Graphcore sought for its ResNet-50 time-to-train comparison, ResNet-50 training would require the same
number of host CPUs (any advantage is for BERT only in this example).

Intel Habana Labs

Intel’s Habana Labs submitted its second round of MLPerf training scores using its Gaudi training accelerator
chip. Since the last round, Gaudi’s performance has doubled for BERT. ResNet-50 scores also improved by 11
percent.

Habana also demonstrated the scalability of its Gaudi technology, presenting similar results for naïve and weak
scaling (weak scaling is not covered in MLPerf results).

Itay Hubara, Habana’s senior researcher, said naïve scaling considers the time to train for systems at different
scales. Weak scaling is derived from naïve scaling results. Increasing the number of accelerators typically
entails
We useincreasing
cookies onbatch
oursize (the number
website to giveofyou
training data samples
the most relevantsimultaneously
experience by fedremembering
into the system) in order
your
to keep the hardware fully utilized. But increased batch size usually requires more training
preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. iterations since
weights are
However, updated
you after"Cookie
may visit processing more data
Settings" to samples.
provide aThat means more
controlled training data are required to
consent.
achieve the same result in larger systems. Weak scaling is the naïve scaling score normalized per throughput,
or to the Settings
Cookie same amountAccept
of dataAllbeing processed.

4 of 6 12/5/2021, 10:15 PM
EETimes - MLPerf Training Scores: Microsoft Demonstrates Fastest Clo... https://www.eetimes.com/mlperf-training-scores-microsoft-demonstrates...

< https://www.eetimes.com/wp-content/uploads/Habana-Scaling-graphs.jpg>
Habana’s naïve scaling results per MLPerf (left graph) versus weak scaling results (right
graph) showing similar results. (Source: Habana Labs)

“Our weak scaling and naïve scaling figures are very close for up to 64 Gaudi chips because we didn’t have to
increase the batch size. We can work with a small local batch size,” Hubara said. “When [switching] to 16
[accelerators from eight], I don’t have to increase the global batch size by 2x… The architecture of Gaudi
enables us to get high utilization even if I don’t take the maximum batch size that I can put into the device.”

Habana’s scores have improved over the last round, once again as a result of software optimizations.

BERT training times were halved thanks to data-packing techniques, where shorter sentences in the training
data were packed together into one multi-sequence. (Shorter sentences would otherwise be padded with zeros
to achieve a fixed input size.) Data packing is handled in pre-processing, and is not part of the benchmarked
training time.

Habana also implemented light checkpoint saving, since the time required to save checkpoints becomes
significant. Rather than saving a checkpoint, each worker saves a subset of the model weights, boosting speed.

Asked whether Habana accelerators could operate with fewer host CPUs, Hubara said: “The ratio of host CPUs
to Gaudi cards can be changed; it is not a limit of our Gaudi card. Yet, a typical system has two Xeon sockets
for eight accelerators. We use this configuration since we aim to replace GPU-based systems, and our
customers prefer dual-socket systems.”

Google

Google did not submit MLPerf training scores into the closed division, but did submit two scores in the open
division for a pair of very large models, both architecturally similar to MLPerf’s BERT model but with larger
dimensions and more layers.

One score trained a 480-billion-parameter, Transformer-based, encoder-only benchmark using TensorFlow


running on a 2,048-accelerator TPUv4 system, training in approximately 55 hours.

The other score trained a 200-billion-parameter JAX model on a 1,024-chip TPUv4 system, training in
approximately 40 hours.

Google said that each training run achieved a computational efficiency of 63 percent.
We use cookies on our website to give you the most relevant experience by remembering your
preferences
The full list ofand repeat
MLPerf visits. By
AI Training clicking scores
benchmark “Accept All”, <
is here you consent to the use of ALL the cookies.
https://mlcommons.org/en/training-
However, you
normal-11/> . may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

5 of 6 12/5/2021, 10:15 PM
EETimes - MLPerf Training Scores: Microsoft Demonstrates Fastest Clo... https://www.eetimes.com/mlperf-training-scores-microsoft-demonstrates...

Sally Ward-Foxton
Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all
aspects of the European industry for EETimes Europe magazine. Sally has spent more than
15 years writing about the electronics industry from London, UK. She has written for
Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many
more. She holds a Masters' degree in Electrical and Electronic Engineering from the
University of Cambridge.

We use cookies on our website to give you the most relevant experience by remembering your
preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies.
However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

6 of 6 12/5/2021, 10:15 PM

You might also like