You are on page 1of 13

A Full Hardware Guide to Deep Learning

2018-12-16 by Tim Dettmers

Deep Learning is very computationally intensive, so you will need a fast


CPU with many cores, right? Or is it maybe wasteful to buy a fast
CPU? One of the worst things you can do when building a deep learning
system is to waste money on hardware that is unnecessary. Here I will
guide you step by step through the hardware you will need for a cheap
high-performance system.

Over the years, I build a total of 7 different deep learning workstations


and despite careful research and reasoning, I made my fair share of
mistake in selecting hardware parts. In this guide, I want to share my
experience that I gained over the years so that you do not make the
same mistakes that I did before.

The blog post is ordered by mistake severity. This means the mistakes
where people usually waste the most money come first.

GPU
This blog post assumes that you will use a GPU for deep learning. If you
are building or upgrading your system for deep learning, it is not sensible
to leave out the GPU. The GPU is just the heart of deep learning
applications – the improvement in processing speed is just too huge to
ignore.

I talked at length about GPU choice in my GPU recommendations blog


post, and the choice of your GPU is probably the most critical choice for
your deep learning system. There are three main mistakes that you can
make when choosing a GPU: (1) bad cost/performance, (2) not enough
memory, (3) poor cooling.

For good cost/performance, I generally recommend an RTX 2070 or an


RTX 2080 Ti. If you use these cards you should use 16-bit models.
Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from
eBay are fair choices and you can use these GPUs with 32-bit (but not
16-bit).
Be careful about the memory requirements when you pick your GPU.
RTX cards, which can run in 16-bits, can train models which are twice as
big with the same memory compared to GTX cards. As such RTX cards
have a memory advantage and picking RTX cards and learn how to use
16-bit models effectively will carry you a long way. In general, the
requirements for memory are roughly the following:

• Research that is hunting state-of-the-art scores: >=11 GB


• Research that is hunting for interesting architectures: >=8 GB
• Any other research: 8 GB
• Kaggle: 4 – 8 GB
• Startups: 8 GB (but check the specific application area for model
sizes)
• Companies: 8 GB for prototyping, >=11 GB for training

Another problem to watch out for, especially if you buy multiple RTX
cards is cooling. If you want to stick GPUs into PCIe slots which are next
to each other you should make sure that you get GPUs with a blower-
style fan. Otherwise you might run into temperature issues and your
GPUs will be slower (about 30%) and die faster.

Suspect
line-up
Can you identify the hardware part which is at fault for bad
performance? One of these GPUs? Or maybe it is the fault of the CPU
after all?
RAM
The main mistakes with RAM is to buy RAM with a too high clock rate.
The second mistake is to buy not enough RAM to have a smooth
prototyping experience.

Needed RAM Clock Rate


RAM clock rates are marketing stints where RAM companies lure you
into buying “faster” RAM which actually yields little to no performance
gains. This is best explained by “Does RAM speed REALLY matter?”
video on RAM von Linus Tech Tips.

Furthermore, it is important to know that RAM speed is pretty much


irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1)
if you used pinned memory, your mini-batches will be transferred to the
GPU without involvement from the CPU, and (2) if you do not use pinned
memory the performance gains of fast vs slow RAMs is about 0-3% —
spend your money elsewhere!

RAM Size
RAM size does not affect deep learning performance. However, it might
hinder you from executing your GPU code comfortably (without
swapping to disk). You should have enough RAM to comfortable work
with your GPU. This means you should have at least the amount of RAM
that matches your biggest GPU. For example, if you have a Titan RTX
with 24 GB of memory you should have at least 24 GB of RAM. However,
if you have more GPUs you do not necessarily need more RAM.

The problem with this “match largest GPU memory in RAM” strategy is
that you might still fall short of RAM if you are processing large datasets.
The best strategy here is to match your GPU and if you feel that you do
not have enough RAM just buy some more.

A different strategy is influenced by psychology: Psychology tells us that


concentration is a resource that is depleted over time. RAM is one of the
few hardware pieces that allows you to conserve your
concentration resource for more difficult programming problems.
Rather than spending lots of time on circumnavigating RAM bottlenecks,
you can invest your concentration on more pressing matters if you
have more RAM. With a lot of RAM you can avoid those
bottlenecks, save time and increase productivity on more pressing
problems. Especially in Kaggle competitions, I found additional RAM
very useful for feature engineering. So if you have the money and do a lot
of pre-processing then additional RAM might be a good choice. So with
this strategy, you want to have more, cheap RAM now rather than later.

CPU
The main mistake that people make is that people pay too much
attention to PCIe lanes of a CPU. You should not care much about PCIe
lanes. Instead, just look up if your CPU and motherboard combination
supports the number of GPUs that you want to run. The second most
common mistake is to get a CPU which is too powerful.

CPU and PCI-Express


People go crazy about PCIe lanes! However, the thing is that it has
almost no effect on deep learning performance. If you have a single GPU,
PCIe lanes are only needed to transfer data from your CPU RAM to your
GPU RAM quickly. However, an ImageNet batch of 32 images
(32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3
milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are
theoretic numbers, and in practice you often see PCIe be twice as slow —
but this is still lightning fast! PCIe lanes often have a latency in the
nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images


and a ResNet-152 the following timing:

• Forward and backward pass: 216 milliseconds (ms)


• 16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
• 8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
• 4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase
of roughly 3.2%. However, if you use PyTorch’s data loader with pinned
memory you gain exactly 0% performance. So do not waste your money
on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make
sure that you select a combination which supports the desired number
of GPUs. If you buy a motherboard that supports 2 GPUs, and you want
to have 2 GPUs eventually, make sure that you buy a CPU that supports
2 GPUs, but do not necessarily look at PCIe lanes.
PCIe Lanes and Multi-GPU Parallelism
Are PCIe lanes important if you train networks on multiple GPUs with
data parallelism? I have published a paper on this at ICLR2016, and I can
tell you if you have 96 GPUs then PCIe lanes are really important.
However, if you have 4 or fewer GPUs this does not matter much. If you
parallelize across 2-3 GPUs, I would not care at all about PCIe lanes.
With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes
per GPU (32 PCIe lanes in total). Since almost nobody runs a system with
more than 4 GPUs as a rule of thumb: Do not spend extra money to get
more PCIe lanes per GPU — it does not matter!

Needed CPU Cores


To be able to make a wise choice for the CPU we first need to
understand the CPU and how it relates to deep learning. What does the
CPU do for deep learning? The CPU does little computation when you
run your deep nets on a GPU. Mostly it (1) initiates GPU function calls,
(2) executes CPU functions.

By far the most useful application for your CPU is data preprocessing.
There are two different common data processing strategies which have
different CPU needs.

The first strategy is preprocessing while you train:

Loop:

1. Load mini-batch
2. Preprocess mini-batch
3. Train on mini-batch

The second strategy is preprocessing before any training:

1. Preprocess data
2. Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch

For the first strategy, a good CPU with many cores can boost
performance significantly. For the second strategy, you do not need a
very good CPU. For the first strategy, I recommend a minimum of 4
threads per GPU — that is usually two cores per GPU. I have not done
hard tests for this, but you should gain about 0-5% additional
performance per additional core/GPU.

For the second strategy, I recommend a minimum of 2 threads per GPU


— that is usually one core per GPU. You will not see significant gains in
performance when you have more cores if you are using the second
strategy.

Needed CPU Clock Rate (Frequency)


When people think about fast CPUs they usually first think about the
clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for
comparing processors with the same architecture, e.g. “Ivy Bridge”, but it
does not compare well between processors. Also, it is not always the
best measure of performance.

In the case of deep learning there is very little computation to be done by


the CPU: Increase a few variables here, evaluate some Boolean
expression there, make some function calls on the GPU or within the
program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has
100% usage when I run deep learning programs, so what is the issue
here? I did some CPU core rate underclocking experiments to find out.
CPU
underclocking on MNIST and ImageNet: Performance is measured as
time taken on 200 epochs MNIST or a quarter epoch on ImageNet with
different CPU core clock rates, where the maximum clock rate is taken
as a baseline for each CPU. For comparison: Upgrading from a GTX 680
to a GTX Titan is about +15% performance; from GTX Titan to GTX 980
another +20% performance; GPU overclocking yields about +5%
performance for any GPU
Note that these experiments are on a hardware that is dated, however,
these results should still be the same for modern CPUs/GPUs.

Hard drive/SSD
The hard drive is not usually a bottleneck for deep learning. However, if
you do stupid things it will hurt you: If you read your data from disk
when they are needed (blocking wait) then a 100 MB/s hard drive will
cost you about 185 milliseconds for an ImageNet mini-batch of size 32 —
ouch! However, if you asynchronously fetch the data before it is used
(for example torch vision loaders), then you will have loaded the mini-
batch in 185 milliseconds while the compute time for most deep neural
networks on ImageNet is about 200 milliseconds. Thus you will not face
any performance penalty since you load the next mini-batch while the
current is still computing.

However, I recommend an SSD for comfort and productivity: Programs


start and respond more quickly, and pre-processing with large files is
quite a bit faster. If you buy an NVMe SSD you will have an even
smoother experience when compared to a regular SSD.
Thus the ideal setup is to have a large and slow hard drive for datasets
and an SSD for productivity and comfort.

Power supply unit (PSU)


Generally, you want a PSU that is sufficient to accommodate all your
future GPUs. GPUs typically get more energy efficient over time; so
while other components will need to be replaced, a PSU should last a
long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU
and GPUs with an additional 10% of watts for other components and as
a buffer for power spikes. For example, if you have 4 GPUs with each
250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU
with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add
another 10% just to be sure everything works out, which in this case
would result in a total of 1375 Watts. I would round up in this case an get
a 1400 watts PSU.

One important part to be aware of is that even if a PSU has the required
wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make
sure you have enough connectors on the PSU to support all your GPUs!

Another important thing is to buy a PSU with high power efficiency


rating – especially if you run many GPUs and will run them for a longer
time.

Running a 4 GPU system on full power (1000-1500 watts) to train a


convolutional net for two weeks will amount to 300-500 kWh, which in
Germany – with rather high power costs of 20 cents per kWh – will
amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then
training such a net with an 80% power supply would increase the costs
by an additional 18-26€ – ouch! This is much less for a single GPU, but
the point still holds – spending a bit more money on an efficient power
supply makes good sense.

Using a couple of GPUs around the clock will significantly increase your
carbon footprint and it will overshadow transportation (mainly airplane)
and other factors that contribute to your footprint. If you want to be
responsible, please consider going carbon neutral like the NYU Machine
Learning for Language Group (ML2) — it is easy to do, cheap, and should
be standard for deep learning researchers.

CPU and GPU Cooling


Cooling is important and it can be a significant bottleneck which reduces
performance more than poor hardware choices do. You should be fine
with a standard heat sink or all-in-one (AIO) water cooling solution for
your CPU, but what for your GPU you will need to make special
considerations.

Air Cooling GPUs


Air cooling is safe and solid for a single GPU or if you have multiple GPUs
with space between them (2 GPUs in a 3-4 GPU case). However, one of
the biggest mistakes can be made when you try to cool 3-4 GPUs and
you need to think carefully about your options in this case.

Modern GPUs will increase their speed – and thus power consumption –
up to their maximum when they run an algorithm, but as soon as the
GPU hits a temperature barrier – often 80 °C – the GPU will decrease
the speed so that the temperature threshold is not breached. This
enables the best performance while keeping your GPU safe from
overheating.

However, typical pre-programmed schedules for fan speeds are badly


designed for deep learning programs, so that this temperature threshold
is reached within seconds after starting a deep learning program. The
result is a decreased performance (0-10%) which can be significant for
multiple GPUs (10-25%) where the GPU heat up each other.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are
optimized for Windows. You can change the fan schedule with a few
clicks in Windows, but not so in Linux, and as most deep learning
libraries are written for Linux this is a problem.

The only option under Linux is to use to set a configuration for your Xorg
server (Ubuntu) where you set the option “coolbits”. This works very
well for a single GPU, but if you have multiple GPUs where some of them
are headless, i.e. they have no monitor attached to them, you have to
emulate a monitor which is hard and hacky. I tried it for a long time and
had frustrating hours with a live boot CD to recover my graphics settings
– I could never get it running properly on headless GPUs.

The most important point of consideration if you run 3-4 GPUs on air
cooling is to pay attention to the fan design. The “blower” fan design
pushes the air out to the back of the case so that fresh, cooler air is
pushed into the GPU. Non-blower fans suck in air in the vincity of the
GPU and cool the GPU. However, if you have multiple GPUs next to each
other then there is no cool air around and GPUs with non-blower fans
will heat up more and more until they throttle themselves down to reach
cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all
costs.

Water Cooling GPUs For Multiple GPUs


Another, more costly, and craftier option is to use water cooling. I do not
recommend water cooling if you have a single GPU or if you have space
between your two GPUs (2 GPUs in 3-4 GPU board). However, water
cooling makes sure that even the beefiest GPU stay cool in a 4 GPU
setup which is not possible when you cool with air. Another advantage of
water cooling is that it operates much more silently, which is a big plus if
you run multiple GPUs in an area where other people work. Water
cooling will cost you about $100 for each GPU and some additional
upfront costs (something like $50). Water cooling will also require some
additional effort to assemble your computer, but there are many
detailed guides on that and it should only require a few more hours of
time in total. Maintenance should not be that complicated or effortful.

A Big Case for Cooling?


I bought large towers for my deep learning cluster, because they have
additional fans for the GPU area, but I found this to be largely irrelevant:
About 2-5 °C decrease, not worth the investment and the bulkiness of
the cases. The most important part is really the cooling solution directly
on your GPU — do not select an expensive case for its GPU cooling
capability. Go cheap here. The case should fit your GPUs but thats it!

Conclusion Cooling
So in the end it is simple: For 1 GPU air cooling is best. For multiple
GPUs, you should get blower-style air cooling and accept a tiny
performance penalty (10-15%), or you pay extra for water cooling which
is also more difficult to setup correctly and you have no performance
penalty. Air and water cooling are all reasonable choices in certain
situations. I would however recommend air cooling for simplicity in
general — get a blower-style GPU if you run multiple GPUs. If you want
to user water cooling try to find all-in-one (AIO) water cooling solutions
for GPUs.

Motherboard
Your motherboard should have enough PCIe ports to support the
number of GPUs you want to run (usually limited to four GPUs, even if
you have more PCIe slots); remember that most GPUs have a width of
two PCIe slots, so buy a motherboard that has enough space between
PCIe slots if you intend to use multiple GPUs. Make sure your
motherboard not only has the PCIe slots, but actually supports the GPU
setup that you want to run. You can usually find information in this if you
search your motherboard of choice on newegg and look at PCIe section
on the specification page.

Computer Case
When you select a case, you should make sure that it supports full length
GPUs that sit on top of your motherboard. Most cases support full length
GPUs, but you should be suspicious if you buy a small case. Check its
dimensions and specifications; you can also try a google image search of
that model and see if you find pictures with GPUs in them.

If you use custom water cooling, make sure your case has enough space
for the radiators. This is especially true if you use water cooling for your
GPUs. The radiator of each GPU will need some space — make sure your
setup actually fits into the GPU.

Monitors
I first thought it would be silly to write about monitors also, but they make
such a huge difference and are so important that I just have to write about
them.

The money I spent on my 3 27 inch monitors is probably the best money I


have ever spent. Productivity goes up by a lot when using multiple
monitors. I feel desperately crippled if I have to work with a single
monitor. Do not short-change yourself on this matter. What good is a fast
deep learning system if you are not able to operate it in an efficient
manner?

Typical
monitor layout when I do deep learning: Left: Papers, Google searches,
gmail, stackoverflow; middle: Code; right: Output windows, R, folders,
systems monitors, GPU monitors, to-do list, and other small applications.
Some words on building a PC
Many people are scared to build computers. The hardware components
are expensive and you do not want to do something wrong. But it is really
simple as components that do not belong together do not fit together.
The motherboard manual is often very specific how to assemble
everything and there are tons of guides and step by step videos which
guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything
that there is to know about building a computer when you did it once,
because all computer are built in the very same way – so building a
computer will become a life skill that you will be able to apply again and
again. So no reason to hold back!

Conclusion / TL;DR
GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and
GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz;
CPU should support the number of GPUs that you want to run. PCIe
lanes do not matter.
RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.

Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.

PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for
required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)

Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling
solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds

Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU
takes two slots; max 4 GPUs per system)

Monitors:
– An additional monitor might make you more productive than an
additional GPU.

Update 2018-12-14: Reworked entire blog post with up-to-date


recommendations.
Update 2015-04-22: Removed recommendation for GTX 580

You might also like