A Full Hardware Guide To Deep Learning - Tim Dettmers

2024/4/2 12:40 A Full Hardware Guide to Deep Learning — Tim Dettmers
A Full Hardware Guide to

Deep Learning
2018-12-16 by Tim Dettmers 945 Comments
Deep Learning is very computationally intensive, so you will need a fast CPU with
many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things
you can do when building a deep learning system is to waste money on hardware that
is unnecessary. Here I will guide you step by step through the hardware you will need
for a cheap high-performance system.
Over the years, I build a total of 7 different deep learning workstations and despite
careful research and reasoning, I made my fair share of mistake in selecting hardware
parts. In this guide, I want to share my experience that I gained over the years so that
you do not make the same mistakes that I did before.
The blog post is ordered by mistake severity. This means the mistakes where people
usually waste the most money come first.
Contents hide
GPU
RAM
Needed RAM Clock Rate
RAM Size
CPU
CPU and PCI-Express
PCIe Lanes and Multi-GPU Parallelism
Needed CPU Cores
Needed CPU Clock Rate (Frequency)
Hard drive/SSD
Power supply unit (PSU)
CPU and GPU Cooling
Air Cooling GPUs
Water Cooling GPUs For Multiple GPUs
A Big Case for Cooling?
Conclusion Cooling
Motherboard
Computer Case
Monitors
Some words on building a PC
Conclusion / TL;DR
Related
Related Posts
https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/ 1/366
GPU
This blog post assumes that you will use a GPU for deep learning. If you are building
or upgrading your system for deep learning, it is not sensible to leave out the GPU.
The GPU is just the heart of deep learning applications – the improvement in
processing speed is just too huge to ignore.
I talked at length about GPU choice in my GPU recommendations blog post, and the
choice of your GPU is probably the most critical choice for your deep learning system.
There are three main mistakes that you can make when choosing a GPU: (1) bad
cost/performance, (2) not enough memory, (3) poor cooling.
For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If
you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080,
GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs
with 32-bit (but not 16-bit).
Be careful about the memory requirements when you pick your GPU. RTX cards, which
can run in 16-bits, can train models which are twice as big with the same memory
compared to GTX cards. As such RTX cards have a memory advantage and picking
RTX cards and learn how to use 16-bit models effectively will carry you a long way. In
general, the requirements for memory are roughly the following:
Research that is hunting state-of-the-art scores: >=11 GB

Research that is hunting for interesting architectures: >=8 GB
Any other research: 8 GB
Kaggle: 4 – 8 GB
Startups: 8 GB (but check the specific application area for model sizes)
Companies: 8 GB for prototyping, >=11 GB for training
Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If
you want to stick GPUs into PCIe slots which are next to each other you should make
sure that you get GPUs with a blower-style fan. Otherwise you might run into
temperature issues and your GPUs will be slower (about 30%) and die faster.
Suspect line-up
Can you identify the hardware part which is at fault for bad
performance? One of these GPUs? Or maybe it is the fault of
the CPU after all?
RAM
The main mistakes with RAM is to buy RAM with a too high clock rate. The second
mistake is to buy not enough RAM to have a smooth prototyping experience.
Needed RAM Clock Rate

RAM clock rates are marketing stints where RAM companies lure you into buying
“faster” RAM which actually yields little to no performance gains. This is best explained
by “Does RAM speed REALLY matter?” video on RAM von Linus Tech Tips.
Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast
CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory,
your mini-batches will be transferred to the GPU without involvement from the CPU,
and (2) if you do not use pinned memory the performance gains of fast vs slow
RAMs is about 0-3% — spend your money elsewhere!
RAM Size
RAM size does not affect deep learning performance. However, it might hinder you
from executing your GPU code comfortably (without swapping to disk). You should
have enough RAM to comfortable work with your GPU. This means you should have at
least the amount of RAM that matches your biggest GPU. For example, if you have a
Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if
you have more GPUs you do not necessarily need more RAM.
The problem with this “match largest GPU memory in RAM” strategy is that you might
still fall short of RAM if you are processing large datasets. The best strategy here is to
match your GPU and if you feel that you do not have enough RAM just buy some
more.
A different strategy is influenced by psychology: Psychology tells us that concentration

is a resource that is depleted over time. RAM is one of the few hardware pieces that
allows you to conserve your concentration resource for more difficult programming
problems. Rather than spending lots of time on circumnavigating RAM bottlenecks,
you can invest your concentration on more pressing matters if you have more RAM.
With a lot of RAM you can avoid those bottlenecks, save time and increase
productivity on more pressing problems. Especially in Kaggle competitions, I found
additional RAM very useful for feature engineering. So if you have the money and do
a lot of pre-processing then additional RAM might be a good choice. So with this
strategy, you want to have more, cheap RAM now rather than later.
CPU
The main mistake that people make is that people pay too much attention to PCIe
lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if
your CPU and motherboard combination supports the number of GPUs that you want
to run. The second most common mistake is to get a CPU which is too powerful.
CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on
deep learning performance. If you have a single GPU, PCIe lanes are only needed to
transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet
batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3
milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic
numbers, and in practice you often see PCIe be twice as slow — but this is still
lightning fast! PCIe lanes often have a latency in the nanosecond range and thus
latency can be ignored.
Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-
152 the following timing:
Forward and backward pass: 216 milliseconds (ms)

16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)
Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly
3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly
0% performance. So do not waste your money on PCIe lanes if you are using a single
GPU!
When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you
select a combination which supports the desired number of GPUs. If you buy a
motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make
sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe
lanes.
PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism?
I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs
then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does
not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe
lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per
GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4
GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU
— it does not matter!
Needed CPU Cores

To be able to make a wise choice for the CPU we first need to understand the CPU and
how it relates to deep learning. What does the CPU do for deep learning? The CPU
does little computation when you run your deep nets on a GPU. Mostly it (1) initiates
GPU function calls, (2) executes CPU functions.
By far the most useful application for your CPU is data preprocessing. There are two
different common data processing strategies which have different CPU needs.
The first strategy is preprocessing while you train:
Loop:
1. Load mini-batch
2. Preprocess mini-batch
3. Train on mini-batch
The second strategy is preprocessing before any training:
1. Preprocess data
2. Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch
For the first strategy, a good CPU with many cores can boost performance significantly.
For the second strategy, you do not need a very good CPU. For the first strategy, I
recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I
have not done hard tests for this, but you should gain about 0-5% additional
performance per additional core/GPU.
For the second strategy, I recommend a minimum of 2 threads per GPU — that is
usually one core per GPU. You will not see significant gains in performance when you
have more cores if you are using the second strategy.
Needed CPU Clock Rate (Frequency)

When people think about fast CPUs they usually first think about the clock rate. 4GHz
is better than 3.5GHz, or is it? This is generally true for comparing processors with the
same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors.
Also, it is not always the best measure of performance.
In the case of deep learning there is very little computation to be done by the CPU:
Increase a few variables here, evaluate some Boolean expression there, make some
function calls on the GPU or within the program – all these depend on the CPU core
clock rate.
While this reasoning seems sensible, there is the fact that the CPU has 100% usage
when I run deep learning programs, so what is the issue here? I did some CPU core
rate underclocking experiments to find out.
CPU underclocking on MNIST and ImageNet: Performance is

measured as time taken on 200 epochs MNIST or a quarter
epoch on ImageNet with different CPU core clock rates, where
the maximum clock rate is taken as a baseline for each CPU.
For comparison: Upgrading from a GTX 680 to a GTX Titan is
about +15% performance; from GTX Titan to GTX 980 another
+20% performance; GPU overclocking yields about +5%
performance for any GPU
Note that these experiments are on a hardware that is dated, however, these results
should still be the same for modern CPUs/GPUs.
Hard drive/SSD
The hard drive is not usually a bottleneck for deep learning. However, if you do stupid
things it will hurt you: If you read your data from disk when they are needed (blocking
wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet
mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it
is used (for example torch vision loaders), then you will have loaded the mini-batch in
185 milliseconds while the compute time for most deep neural networks on ImageNet
is about 200 milliseconds. Thus you will not face any performance penalty since you
load the next mini-batch while the current is still computing.
However, I recommend an SSD for comfort and productivity: Programs start and
respond more quickly, and pre-processing with large files is quite a bit faster. If you buy
an NVMe SSD you will have an even smoother experience when compared to a
regular SSD.
Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for
productivity and comfort.
Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs.
GPUs typically get more energy efficient over time; so while other components will
need to be replaced, a PSU should last a long while so a good PSU is a good
investment.
You can calculate the required watts by adding up the watt of your CPU and GPUs with
an additional 10% of watts for other components and as a buffer for power spikes. For
example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP,
then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would
usually add another 10% just to be sure everything works out, which in this case would
result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.
One important part to be aware of is that even if a PSU has the required wattage, it
might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough
connectors on the PSU to support all your GPUs!
Another important thing is to buy a PSU with high power efficiency rating – especially
if you run many GPUs and will run them for a longer time.
Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net
for two weeks will amount to 300-500 kWh, which in Germany – with rather high
power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a
100% efficiency, then training such a net with an 80% power supply would increase the
costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point
still holds – spending a bit more money on an efficient power supply makes good
sense.
Using a couple of GPUs around the clock will significantly increase your carbon
footprint and it will overshadow transportation (mainly airplane) and other factors that
contribute to your footprint. If you want to be responsible, please consider
going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is
easy to do, cheap, and should be standard for deep learning researchers.
CPU and GPU Cooling

Cooling is important and it can be a significant bottleneck which reduces performance
more than poor hardware choices do. You should be fine with a standard heat sink or
all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will
need to make special considerations.
Air Cooling GPUs

Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space
between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can
be made when you try to cool 3-4 GPUs and you need to think carefully about your
options in this case.
Modern GPUs will increase their speed – and thus power consumption – up to their
maximum when they run an algorithm, but as soon as the GPU hits a temperature
barrier – often 80 °C – the GPU will decrease the speed so that the temperature
threshold is not breached. This enables the best performance while keeping your GPU
safe from overheating.
However, typical pre-programmed schedules for fan speeds are badly designed for
deep learning programs, so that this temperature threshold is reached within seconds
after starting a deep learning program. The result is a decreased performance (0-10%)
which can be significant for multiple GPUs (10-25%) where the GPU heat up each
other.
Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for
Windows. You can change the fan schedule with a few clicks in Windows, but not so in
Linux, and as most deep learning libraries are written for Linux this is a problem.
The only option under Linux is to use to set a configuration for your Xorg server
(Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but
if you have multiple GPUs where some of them are headless, i.e. they have no monitor
attached to them, you have to emulate a monitor which is hard and hacky. I tried it for
a long time and had frustrating hours with a live boot CD to recover my graphics
settings – I could never get it running properly on headless GPUs.
The most important point of consideration if you run 3-4 GPUs on air cooling is to pay
attention to the fan design. The “blower” fan design pushes the air out to the back of
the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in
the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to
each other then there is no cool air around and GPUs with non-blower fans will heat
up more and more until they throttle themselves down to reach cooler temperatures.
Avoid non-blower fans in 3-4 GPU setups at all costs.
Water Cooling GPUs For Multiple

GPUs
Another, more costly, and craftier option is to use water cooling. I do not recommend
water cooling if you have a single GPU or if you have space between your two GPUs (2
GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest
GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another
advantage of water cooling is that it operates much more silently, which is a big plus if
you run multiple GPUs in an area where other people work. Water cooling will cost you
about $100 for each GPU and some additional upfront costs (something like $50).
Water cooling will also require some additional effort to assemble your computer, but
there are many detailed guides on that and it should only require a few more hours of
time in total. Maintenance should not be that complicated or effortful.
A Big Case for Cooling?

I bought large towers for my deep learning cluster, because they have additional fans
for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not
worth the investment and the bulkiness of the cases. The most important part is really
the cooling solution directly on your GPU — do not select an expensive case for its
GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!
Conclusion Cooling
So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should
get blower-style air cooling and accept a tiny performance penalty (10-15%), or you
pay extra for water cooling which is also more difficult to setup correctly and you have
no performance penalty. Air and water cooling are all reasonable choices in certain
situations. I would however recommend air cooling for simplicity in general — get a
blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find
all-in-one (AIO) water cooling solutions for GPUs.
Motherboard
Your motherboard should have enough PCIe ports to support the number of GPUs
you want to run (usually limited to four GPUs, even if you have more PCIe slots);
remember that most GPUs have a width of two PCIe slots, so buy a motherboard that
has enough space between PCIe slots if you intend to use multiple GPUs. Make sure
your motherboard not only has the PCIe slots, but actually supports the GPU setup
that you want to run. You can usually find information in this if you search your
motherboard of choice on newegg and look at PCIe section on the specification page.
Computer Case
When you select a case, you should make sure that it supports full length GPUs that sit
on top of your motherboard. Most cases support full length GPUs, but you should be
suspicious if you buy a small case. Check its dimensions and specifications; you can
also try a google image search of that model and see if you find pictures with GPUs in
them.
If you use custom water cooling, make sure your case has enough space for the
radiators. This is especially true if you use water cooling for your GPUs. The radiator of
each GPU will need some space — make sure your setup actually fits into the GPU.
Monitors
I first thought it would be silly to write about monitors also, but they make such a huge
difference and are so important that I just have to write about them.
The money I spent on my 3 27 inch monitors is probably the best money I have ever
spent. Productivity goes up by a lot when using multiple monitors. I feel desperately
crippled if I have to work with a single monitor. Do not short-change yourself on this
matter. What good is a fast deep learning system if you are not able to operate it in an
efficient manner?
Typical monitor layout when I do deep learning: Left: Papers,

Google searches, gmail, stackoverflow; middle: Code; right:
Output windows, R, folders, systems monitors, GPU monitors,
to-do list, and other small applications.
Some words on building a PC
Many people are scared to build computers. The hardware components are expensive
and you do not want to do something wrong. But it is really
simple as components that do not belong together do not fit together. The
motherboard manual is often very specific how to assemble everything and there are
tons of guides and step by step videos which guide you through the process if you
have no experience.
The great thing about building a computer is, that you know everything that there is to
know about building a computer when you did it once, because all computer are built
in the very same way – so building a computer will become a life skill that you will be
able to apply again and again. So no reason to hold back!
Conclusion / TL;DR
GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti
from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should
support the number of GPUs that you want to run. PCIe lanes do not matter.
RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.
Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.
PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)
Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds
Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots;
max 4 GPUs per system)
Monitors:
– An additional monitor might make you more productive than an additional GPU.
Update 2018-12-14: Reworked entire blog post with up-to-date recommendations.

Update 2015-04-22: Removed recommendation for GTX 580
Related
How To Build and Use a Multi Deep Learning Hardware Limbo

GPU System for Deep Learning 2017-12-21
2014-09-21 In "Hardware"
In "Hardware"
Which GPU(s) to Get for Deep
Learning: My Experience and
Advice for Using GPUs in Deep
Learning
2023-01-30
In "Deep Learning"
Related Posts
Which
Which GPU(s)
GPU(s) to
to
Get for Deep
Get for Deep
How
How to
to Choose
Choose Learning:
Learning: My
My LLM.int8()
LLM.int8() and
and
Your
Your Grad
Grad School
School Experience
Experience and…
and… Emergent
Emergent Features
Features
Filed Under: Hardware

Tagged With: AMD, CPU, GPU, Intel, PCIe Lanes
Comments
Jay says
2021-07-29 at 20:31
Hey,
Thanks for that summary. You said that one should buy a GPU with at least 8GB
RAM but that RTX GPU RAM was twice as effective as GTX RAM. That brings me to
my question.
I have a choice between 2 laptops. Identical except one has an GeForce RTX 3060
6GB and costs $1400; while the other has a GeForce RTX 3070 8GB and costs
$2000.
I know the RTX 3060 will be slower but is 6GB acceptable? You implied it will be the
equivalent of a GeForce GTX 12GB RAM video card for RAM utilization.
Please advise as I’d really like to save the extra $600 in cost between the 2 laptops.
Given that video card add-ins for desktops for 3000 series RTX cards seem to start
at $1000 it seems to me I should bide my time with a good entry level laptop with
an RTX GPU that has much fairer prices until the video card price gouging is done
for.
Thanks!
Reply
Tim Dettmers says

2021-10-24 at 11:26
6GB is indeed a bit small – I would go for the 8 GB GPU
Reply
zoey79 says
2021-06-09 at 09:16
Wonderful article. However, I am about to buy a new laptop. So what do you feel
about the idea of a gaming laptop for deep learning?
Reply
Tim Dettmers says

2021-10-24 at 10:58
Gaming laptops are excellent for deep learning. Make sure to get a beefy GPU!
Reply
TK says
2021-10-24 at 18:22
I had a gaming laptop for deep learning. However I think desktop is still a
better choice . Using Laptop for deep learning tend to overheat the laptop and
battery appears to degrade much faster.
Moreover, the largest gpu memory in a laptop is 8gb but note that not all 8gb
can be allocated for deep learning, which may not be sufficient if you are trying
a very deep network or dual network. Mobile gpu is also less efficient than
desktop gpu. Computing speed (cpu and etc) can also slower than a gaming
desktop.
Reply
Chaitanya says
2021-02-01 at 23:15
Thank you Tim for the post, it was very helpful to understand the importance of
hardware components in deep learning.
I have been researching about the hardware requirements to begin a Deep learning
project on my work station from couple of months, finally read your article that has
answered lot of my questions. I did realize the GPU on my machine will not be
sufficient so wanted to get your thoughts on its replacement or adding a second
one.
Please suggest if I can add any Nvidia 20xx series GPU to below configuration.
– Dual CPU – Xeon E5 2670 – V2 10 cores each, 64GB RAM

– Existing GPU – Nvidia Geforce 1050
– power unit – 800 watts
– two PCi e gen 3 X 16 slots (with 4 other gen slots in between, currently one is in
use for 1050)
Reply
Kriskr3 says
2021-02-01 at 15:12
Hello Tim,
I had read your great article on GPU recommendations for Deep learning, it was
informative and would help anyone who is interested and serious about this field. I
found the article when I searched in google for ideas on GPU upgrade, after
reading your responses to the posts I wanted ask my question right here. I have HP
workstation that has Nvidia Geforce GTX 1050 (4GB) so looking to either replace it
or add another. Power unit is 800 watt, dual CPU, two PCIe GEN 3 X16, one PCI e
GEN3 X8 and three other GEN2. I believe at max I can add one GPU (may be low
wattage) due to space and power limitation. I’m not sure if I can even add Nvidia
Geforce 20X series to the existing or I need to replace. I would appreciate if you can
share view based on your experience.
Reply
Imahn says
2021-01-26 at 13:07
Dear Tim,
I would have five short questions, I am really sorry!
(i) I am generally wondering: I am not 100% sure yet whether I should opt for 2
GPUs or 4GPUs. I would of course first buy 1 GPU and then scale, but if I know a
priori that I plan to have only two GPUs, I could opt for a cheaper MB, CPU, cooler,
PSU, etc. Does one maybe need 2 GPUs to do some testing on hyperparameters of
papers that one reads, and 4 GPUs if one wants to build own neural networks (and
thus test even more ideas)? Do you have any brief thoughts on this, or a link we
could read?
(ii) What do you think of this PSU from EVGA:

https://www.newegg.com/evga-supernova-750-g-120-gp-0750-x1-2000w/p/1HU-
00J7-006T1?
Description=2000%20watt%20power%20supply&cm_re=2000_watt%20power%20s
upply-_-9SIAHT8BHT5519-_-Product
I am asking because in the other post of yours, you were writing about the problem
of
4 x RTX 3090, but wouldn’t this PSU solve the problem? But you didn’t mention this
PSU, that’s why I am confused. (Apparently, this PSU only works under 220 V, so for
me, I couldn’t buy it, but wouldn’t it be great for US Americans?)
(iii) For a possible 4-GPU setup, do you think that an Intel Core i7-9800X with 8
cores is enough for the 4 GPUs at full utilization + the normal things that one does
(reading papers, having Zoom meetings, using LibreOffice, VirtualBox, etc.)? This
CPU would cost me 480 $, but more cores would even cost more. I generally
suspect that I will need a GPU rather than CPU for ML, I know you recommend 1-2
CPUs per GPU, but with 4 GPUs, that would be 4-8 just for the GPUs, so I am
honestly unsure.
(iv) This question is strongly related to (iii): Is it possible to use the Deep Learning
PC for normal home-office while the 4 GPUs are at full utilization?
(v) If I opted for 4 GPUs in blower-style fan, wouldn’t my neighbors be able to hear
it? They have small babies and I am honestly worried that the noise at night would
be too much… Any thoughts would be appreciated.
Reply
Dmytro says
2021-01-21 at 02:44
Hi Tom!
I got CPU : Intel® Pentium(R) CPU G4560 @ 3.50GHz × 4
And got error when try to load model with TF 2.2 and upper
Process finished with exit code 132 (interrupted by signal 4: SIGILL)
when i got TF 1.5 it work fine`s
I read much and find that it connected with CPU , is that true?
I really need understand with what trouble is it
Thank for you`r attention!
Have a nice day.
Reply
marco says
2021-04-01 at 07:23
1. Version of Python and tf should match (please verifiy spec on tf software

requirements)
2. CPU binary precompiled version of Tf can include CPU specific optimizations

that could be not compatibles.
Generally this happen when a different cpu architecture binary is launched on

an unsupported CPU.
Probably tf 1.5 could be run on your CPU, but the new was not compiled for
that or was not compatible with your Python version.
How to solve it:

if u can check the python incompatible issue, that could be solved quickly,
installing appropriate version of Python (I suppose 3.8 bcs I have the same tf
2.2 on my machine and it use 3.8)
If the problem is at binary / cpu compatible level you should :
A: change CPU
or
B: compile tf from source ON YOUR CPU.
I yet compiled tf on my CPU several time and you can do it, don’t worry
Reply
Armand says
2021-01-11 at 11:43
Hi Tim,
I’m building a DL rig for a student organization and I’m wondering how to share it
with students. I want to be able to create VMs and erase/reconfigure them if
students mess up. I want to use it kind of like a personal AWS Cloud.
Do you have any leads I should follow or keywords I should search for ?
Thanks!
Reply
Mira says
2021-01-11 at 02:37
Hi Tim, all,
We are about to buy (when available) RTX 3090 for AI, PyTorch and TensorFlow.
The computer, where I planned to put GPU in has i7-3930K which runs only at pcie
2.0. How much would pcie 2.0 limit the perfomance in the computations?
I know the theoretical throughputs, but I have no idea about real perfomance.
Could you please give me some example of power deprecation?
Thanks, Mira
Reply
Mira says
2021-01-18 at 03:30
Edit: I enabled pcie 3.0 via force-enable-gen3.exe

Now, the throughput and perfomance seems to work fine.
Reply
Audi says
2021-01-07 at 09:29
Hi Tim,
Thanks for this article and late happy new year!

I am currently doing simple ML and some deep learning for images as a hobby
with my laptop. I wanted to build a pc with a budget constraint of around 2k USD
as I wanted to learn more about deep learning and AI as a beginner.
Here is my current PCPartsPicker list too:

PCPartPicker Part List: https://pcpartpicker.com/list/jGM33Z
CPU: AMD Ryzen 9 3900XT 3.8 GHz 12-Core Processor

CPU Cooler: Cooler Master Hyper 212 Black Edition 42 CFM CPU Cooler
Motherboard: Gigabyte X570 AORUS ELITE WIFI ATX AM4 Motherboard
Memory: Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-2666 CL16 Memory (x2)
Storage SSD: ADATA XPG SX8200 Pro 1 TB M.2-2280 NVME Solid State Drive
Storage HDD: Seagate Barracuda Compute 4 TB 3.5″ 5400RPM Internal Hard Drive
Video Card: MSI GeForce RTX 3070 8 GB GAMING X TRIO Video Card
Case: Phanteks Eclipse P300A Mesh ATX Mid Tower Case
Power Supply: Cooler Master MasterWatt 750 W 80+ Bronze Certified Semi-
modular ATX
The question:
1. For the CPU I am conflicted with Ryzen 9 3900XT (12 cores) where people
claimed that each core performance is better than Ryzen Threadripper 2950X (16
cores). For deep learning and ML, which one is better?
2. For the GPU i am also conflicted with going for either MSI RTX 3070 (8gb
memory config, 256 memory bus, and 5888 core ) or Zotac RTX 3080. Your post
recommended to go for RTX 3080 (10gb memory config, 320 memory bus, and
8704 cores) ; however, with my budget, I can only land either of these two where
the RTX 3080 Zotac is claimed to be subpar brand. or maybe should I wait for the
upcoming RTX 3070 Ti (10gb memory config, 320 memory bus and ~7424 cores)
or RTX 3060 non-Ti version(12gb memory config, 192 memory bus, and ~3840
cores) ?
3. As a windows OS user for a long time, should I make it dual OS windows and
Linux or just install Linux for this pc? (I am not getting used to Linux performance
and would like to use my pc for program like office and some steam games)
Thanks!
Reply
David says
2021-03-06 at 09:17
Hey Audi,
I’m currently struggling with a similar problem. Either the Ryzen 9 3900x or the
5800x. Do you know which one is better for deep learning? Following the
explanation given by tim I suppose that the 12 cores outperforms the 8 cores of
the 5800x?
Reply
Frederick Carlson says

2021-01-04 at 07:28
Scary. This is very similar to my build
Great article and site – BTW
CPU: Intel Core i9-10900X Cascade Lake 3.7GHz Ten-Core

CPU Cooler: Noctua NH-D15S CPU Cooler
Motherboard: Gigabyte X299X Aorus Master Intel LGA 2066 eATX Motherboard
Memory:
G.Skill Ripjaws V Series 32 GB (2 x 16 GB) DDR4-3200 CL16 Memory
G.Skill Trident Z RGB 32GB (2 x 16GB) DDR4-3200 PC4-25600 CL16 Memory
Storage: Two Samsung M2 SSD (2X500 GB) and 4 2TB WD Black HDDs
Video Card: NVIDIA GeForce 1080ti (x1)
Video Card: NVIDIA GeForce 1070ti (x3)
Case: Lian Li O11D XL-W ATX Full Tower Case
Power Supply: Thermaltake 1250 W
Reply
Imahn says
2021-01-02 at 08:35
Hi Tim,
(i) I hope you are doing fine! I am currently searching a bit for an appropriate
motherboard to support 2-3 GPU’s, and I honestly don’t know how to read the
specifications to decide …
So the way I understood some of your comments, it is not necessary to have PCIe
4.0 lanes, but PCIe 3.0 lanes seem to do their job for Machine Learning. Now let’s
say that I want to have three GPU’s, what does the specification need to say? Since I
cannot find RTX 3080 in blower-style fans, I suspect that I need enough space
between the GPU’s as to not run into cooling problems.
As an example, let’s consider the MSI Z490-A PRO ATX (https://www.cyberport.de/?

DEEP=2303-9AR&APID=21&msclkid=19d8555294d71af589d60ab781345cc7), I’m
sorry this website is in German …
It says that this Motherboard has the following specs:

2x PCIe 3.0 x16 (1x x16, 1x x4), 3x PCIe 3.0 x1
Is this good for 3 GPU’s? To me, on the image, the 3 x PCIe 3.0 x1 look really small,
so I guess only the 3.0 x 16 could be used for GPU’s.
(ii) Does it make sense to buy a 3-year warranty for a Motherboard?
(iii) The Motherboard would come with SATA-cables, would I need more cables to
connect the Motherboard to the PSU or the GPU’s?
Thanks!
Reply
Tim Dettmers says

2021-01-19 at 15:53
Hi Imahn,
what you want to look out for is a motherboard for specifications that say X
16/X 16/X 16 all the same with the eight instead. This indicates that the
motherboard supports three GPUs in each GPU has 16 lanes. This is different
from how many PCIe slots you have. The easiest way to check a motherboard
is to go to Newegg.com as it has the most information on hardware and the
information a standardized. Seems for your motherboard only supports one
GPU.
Reply
Lauren says
2020-12-13 at 14:52
Thanks so much for your post!! I’m trying to build my first personal setup for deep
learning. I’m hoping to start with one RTX 3090, but then have space for up to
three if I wanted to expand in the future. Do you have any advice on this setup:
CPU: Intel Core i9-10850K Comet Lake 10-Core 3.6 GHz LGA 1200 125W
GPU: RTX 3090
MB: ASUS WS X299 SAGE LGA 2066 Intel X299
RAM: 128GB: Corsair LPX 8*16GB 3200
PS: CORSAIR AX1600i 1600W
Case: Fractal Design Define 7 XL
SSD: HP EX920 M.2 1TB PCIe NVMe NAND SSD
HD: Western Digital 4TB
My plan was to start with 1 GPU, and allow room to expand. Looks like this setup
should support up to three with room around each GPU for cooling? Do you think
I’d need an additional cooler? Any other advice or suggestions on this setup?
Thanks so much, I really appreciate any thoughts that you have!

Lauren
Reply
Tim Dettmers says

2021-01-02 at 01:32
Hi Lauren, the build looks fine to me. Make sure that you can fit all potential
three RTX 3090 in the case that you chose.
Reply
Brandon Wolfson says

2020-12-01 at 10:15
Hi Tim,
Thanks for this article, this is super helpful for a first time builder like me. I had a few
questions:
1) Do you think intel i9-10900F will be enough cores (it has 10 cores) for a dual RTX
3090 build? I know you recommend min 2 cores/GPU, but I asked Lambda Labs
and they recommended min 12 cores for dual RTX 3090 and so I got worried.
2) Also, I realize dual RTX 3090 build is probably impractical with this mobo. In that
case, do you think a RTX 3090 and RTX 3080 Ti (hopefully it comes out) would work
well with this setup?
3) Lastly, do you have any thoughts on vertical monitors or ultra-wide monitors?
Here is my PCPartsPicker list too:

PCPartPicker Part List: https://pcpartpicker.com/list/zvCZ2V
CPU: Intel Core i9-10900F 2.8 GHz 10-Core Processor

CPU Cooler: *Noctua NH-D15 82.5 CFM CPU Cooler
Motherboard: ASRock Z490 Taichi ATX LGA1200 Motherboard
Memory: G.Skill Ripjaws V Series 32 GB (2 x 16 GB) DDR4-3200 CL16 Memory
Storage: ADATA XPG SX8200 Pro 2 TB M.2-2280 NVME Solid State Drive
Video Card: NVIDIA GeForce RTX 3090 24 GB Founders Edition Video Card
Video Card: NVIDIA GeForce RTX 3090 24 GB Founders Edition Video Card (or
maybe RTX 3080 Ti for this 2nd GPU)
Case: Lian Li O11D XL-W ATX Full Tower Case
Power Supply: Corsair HX Platinum 1200 W 80+ Platinum Certified Fully Modular
ATX Power Supply
Thanks!
Reply
Brandon Wolfson says

2020-12-01 at 10:17
BTW, I plan to get more system memory too once I get a 2nd GPU. I was
thinking 32 GB to start with 1 GPU, then buying more once I get my 2nd GPU.
Thanks!
Reply
Tim Dettmers says

2021-01-02 at 01:12
Hi Brandon,
1) 10 cores is plenty, you will be fine!

2) I would try to avoid mixing GPU types to allow for better parallelization. If
parallelization is not that important for you, this could be a good option.
3) I personally dislike vertical monitors. It is easier on the neck to move left-to-
right rather than up-down. I think even large documents can be read quite well
on an ultra-wide monitor.
The build looks solid to me!

Reply
John B says
2021-01-19 at 19:23
Hello Brandon,
I am lookinh to build my first set up for DL and willl like to know how you
choice of parts are helping you so far?
Reply
TOBIN says
2020-11-28 at 06:30
Hello Tim,
Thanks for the blog. I read the blog fully and I am building a deep learning
machine and I would like to have you expertise in building a perfect machine for
my purpose.
I am a start-up, I am building this machine for my start-up which is used for people
tracking using multi camera setup, and deep learning classification based on their
actions.
I intended to use the multi camera people tracking and deep learning classification
on the GPU and use the output.
Once I got this project on the woking phase, I will use this machine as a backend
server for a small group of people around 20 to 30 tracking and classification.
I am building this machine on a budget of around 3500 Dollars. This is my build:
GPU: 1x ZOTAC RTX 3090 (~ 2000 $)
RAM: 1x CORSAIR VENGEANCE LPX 32GB (16GBX2) DDR4 DRAM 3000MHZ C16
MEMORY KIT (~ 140$)
CPU: AMD RYZEN 5 5600X PROCESSOR (UPTO 4.6 GHZ) (~ 440$)
SSD: 1x Samsung 970 PRO NVME M.2 512GB SSD (~ 365$)
Hard Disk: 1x WD BLUE 2TB INTERNAL HDD (~ 65$)
PSU: ANTEC HCP1000 80 PLUS PLATINUM SERIES 1000 WATTS (~ 160$)
Case: DEEPCOOL GAMERSTORM MACUBE (~ 110$)
MotherBoard: GIGABYTE B550 AORUS ELITE AX WIFI (~ 215$)
Any suggestion on this build?

What type of cooling is required for this machine?
Should I go for a cheaper products in any of these line up?
Is it enough with RTX 3090 or should we use Titan RTX or anything else?
How about the CPU is it enough for this work load?
Any thing else you want to comment?
Reply
Tim Dettmers says

2021-01-02 at 01:04
Hello Tobin,
this looks good. One thing though if you want to use the server as a backend
for 20 to 30 models, it might be more manageable if you have more multiple
smaller GPUs to spread the load. I am not sure how much memory you will
need and what your budget is, but either multiple RTX 2070, RTX 2080 Ti, RTX
3070, or RTX 3080 Ti might be a better choice than a single RTX 3090.
Reply
John B says
2021-01-19 at 19:24
Hello Tobin,
I am watching this space to get an idea on how to build my first set up for DL
and willl like to know how your choice of parts are helping you so far?
Reply
Dominik P says
2020-11-25 at 16:44
Hi Tim,
Thanks for the great article! I’m a first-time PC builder, trying to build an ML
Workstation on a 2000-3000€ budget. This is my current plan:
CPU: Ryzen 9 3900X

GPU: RTX 3090
MB: Asus ROG Strix X750-E Gaming Wifi
RAM: Corsair LPX 2*16GB 3200
PS: Corsair 750W
Case: NZXT H710
SSD: Samsung 970 Evo 1 TB M.2
Could you give me some advice on this setup?

Should I get a cheaper motherboard?
Air or liquid cooling? (an amd wraith prism air cooler is included with the cpu)
Is it a problem that the cpu has no integrated memory? I will probably lose 1-2GB
vram to Xorg, right?
Any other recommendations?
Thank you so much!,

Dominik
Reply
Tim Dettmers says

2020-11-27 at 20:31
The question would be: Why you get the PCIe 4.0 motherboard in the first
place? For gaming, you might get some advantages, but not really for anything
deep learning related. So if you want to build a pure deep learning machine, I
would maybe buy a cheaper motherboard. On the other hand, if you later get
NVMe SSD which support full PCIe 4.0 speeds and a second GPU it might be
worth it if you run some things which are very storage-intensive, such as deep
learning with very large datasets. Otherwise, the build looks good!
Reply
Emmanuel says
2020-11-24 at 06:10
Hi Tim,
Thx so lot for your usefull and complete article.

I’m a french PhD student in the first month of his thesis. I have a question about
your paragraph on memory requirements for research in deep learning.
You say that for research that is hunting state-of-the-art scores we need memory
for the GPU >= 11 GB.
I assume that with a memory <= 11 GB the computation still works fine except that
computation duration is bigger. Is it true ?
Bravo for your analysis
Kind regards
Emmanuel
Reply
Tim Dettmers says

2020-11-27 at 20:33
Some deep learning models are so large that you cannot run them with an 11
GB memory (you might be able to do so with some complicated tricks). These
models are usually some big transformer models. If you run only computer
vision, you can come quite far with 8-10GB but your networks might be a bit
slow because you need to run them with a very small batch size.
Reply
Maciek says
2020-11-12 at 00:44
Hi,
Thank for the post.
Did you try using windows with WSL2 for DL ? This could solve some of your
problems (and create new ones)
Reply
Tim Dettmers says

2020-11-12 at 10:29
As I understand it, GPUs and WSL2 do not have an easy time to work together.
However, both PyTorch and Windows have pretty good regular Windows
support as I understand. So one does not need to use WSL2.
Reply
Ryan Adonde says

2020-10-15 at 01:02
Hi Tim, thanks for the great blog.
I was wondering if you had any thoughts on an 8 GPU setup with dual root
architecture (4 GPUs attached to each CPU). My main focus is distributed training
across all 8 GPUs, but have concerns that the CPU/CPU interconnect may become
a bottleneck for communication between GPUs as some other sources have
suggested that dual root is a big no-no when trying to scale across 8 GPUs (for this
exact reason). However, the cards I am looking to get do not support P2P (2080 ti)
and will have to send data via the CPU anyway (for the cards attached to the same
CPU) so was wondering if you had experience on how problematic that extra hop
across CPUs will be for the cards that are not connected to the same CPU.
Many thanks
Reply
Tim Dettmers says

2020-10-15 at 13:17
Usually, it is not that big of an issue and parallelization is still quite fast. If you
use the right software (integrated into most libraries) then the GPU memory will
be transferred to a pinned CPU buffer which can be directly be transferred to

the other CPU/GPU. Overall, the communication cost should only be about
twice as expensive as normally. That sounds like a high cost, but
communication is only a small part of the training costs. As such, your models
will still be quite fast when parallelized. You can expect something like 6.75x
speedup compared to 7.25x for a system with P2P and a single root.
Reply
Ryan Adonde says

2020-10-15 at 17:05
Thanks for the info! Much appreciated
Reply
Bob O says
2020-10-07 at 11:40
Hey Tim,
I finally bit the bullet and built a machine (with the exception of the scarce 3070 that
will go in later). You mentioned a software post, but I could not find it. Do you have
favorites that use would suggest for doing my software setup. I was going to start
with Ubuntu because it does not appear I can access the GPU with the VMs I have
from windows.
I am looking for specific things like openCV in python, caffe, keras, enabling and
using my GPU… basically exactly what you have done for hardware, but the step
following assembly!
Thanks, and please keep up the good work.
Reply
Tim Dettmers says

2020-10-07 at 12:50
Hey Bob,
unfortunately, I do not have a software guide. I would recommend Ubuntu
since using GPUs through a VM can be a pain (or not work at all, depending on
the motherboard). In terms of software, I would look into Anaconda3 on
Ubuntu which is a package manager for scientific computing. You can
download it freely and can install all the software that you mention without the
need for compiling anything. Compiling OpenCV for example can be a pain
whereas in anaconda you just execute “conda install -c anaconda opencv” and
you are done.
Good luck!
Reply
Frank Fletcher says

2020-10-07 at 03:03
Hi Tim, thank you for sharing all your work with these hardware guides!
I don’t know if you have an update for this article somewhere else but there are
now several ways to control the fan curves for NVidia GPUs. The easiest way I’ve
found is to use GreenWithEnvy.
https://gitlab.com/leinardi/gwe
Reply
Tim Dettmers says

2020-10-07 at 12:56
Thank you, Frank, I have not seen it before! This looks excellent, thank you for
sharing! Another package I know about is coolgpus which is designed for
servers where some of the NVIDIA options are not available because no
monitors are connected to the GPUs. So coolgpus is pretty good for servers,
but the gwe package looks a bit better than coolgpus for the desktop case.
Reply
Matt says
2020-10-03 at 11:25
Hey Tim!
Thank for this post – really helpful! For my first PC build, I’m planning on using a
Ryzen 7 3700x CPU and RTX 2080 Super (will replace in the future with the new
GPUs). You talked about GPU cooling but what is your opinion on CPU cooling? Is
the stock cooler for my CPU not good enough and should I consider AIO solutions?
Thanks!
Reply
Tim Dettmers says

2020-10-07 at 13:10
Hey Matt! Often, a stock cooler is okay for the CPU although it can be a bit
loud. Many people are not installing AIO solutions on their CPU for better and
more silent cooling. However, it was shown that a good air cooler is often just
as good and even more silent than AIO water cooling solutions. The bottom
line for deep learning though is mostly noise: If a bit of noise is okay, then go
with stock, if you want a more silent setup go with either AIO or a good air
cooler. In either case, if you train large models that saturate your GPU, your
GPU will also be quite loud, so in that case, a silent CPU cooler will not make
the greatest difference. I personally prefer as silent working environments as
possible, and I always buy a dedicated CPU cooler.
Reply
Bob Nestor says

2020-09-25 at 13:26
Hi Tim:
Awesome content! I’m a retired software engineer looking to learn more about AI &
ML.
I have a few questions about H/W:
– Intel or AMD (I’m leaning towards AMD using an X570 MB)
– Best starter OS
and S/W:
– Best courseware
– Best learning samples.
Cheers…Bob
Thanks for sharing your knowledge.
Reply
Tim Dettmers says

2020-09-29 at 10:25
Hi Bob!
– AMD CPUs are great; so an X570 MB is great.
– Use Ubuntu 20.04 + Anaconda + PyTorch. If you want to do deep learning
that is the way to go. You will have the least issues overall if you use that.
– fast.ai is by far the best course for deep learning for software engineers
– just google around for pytorch samples for the models that you learn about
in the fast.ai classes
Good luck!
Reply
Bob Nestor says

2020-10-02 at 14:42
Hi Tim:
Glad I found your site. I truly appreciate your help in advancing my
knowledge of AI & DL. Really appreciate your help.
Cheers…Bob
Reply
haykelvin says
2020-09-21 at 02:28
Hi Tim, thank you so much for this awesome article. It is very informative and
interesting to see those number(both theoretical and from actual testing) in the
reasoning. I am new to the field and start playing with pytorch recently, sure this will
help me and lots of others to make wise choice when selecting hardware in future
ML builds.
Have read through the thread and saw your concern on AMD gpu software
compatibility issue. I saw a few good deals of VEGA FE in my local second-hand
market, the 16gb of ram looks sweet on paper, do you think those can give me
some good bang for the buck if I don’t mind to experiment with them a bit? I also
see cost efficient upgradability if I want to get more of those in the future second-
hand market. Or would you recommend just stick with CUDA at all?
Reply
Tim Dettmers says

2020-09-29 at 10:44
Our community could definitely need more AMD enthusiasts. Currently, AMD
GPUs work for deep learning, but their performance is not as good and there
might be some hidden issues here and there. So if you want to just get things
running I recommend NVIDIA + CUDA. If you want to contribute actively to the
community AMD and ROCm is great — this option helps a lot of diffuse the
NVIDIA monopoly over time but you can expect a more frustrating experience.
Reply
joy says
2020-09-20 at 07:10
hi, Need one recommendation, i got 8 GPU to build a Deep learning machine ?
which motherboard (which supports AMD 7000 series cpu) you recommend to
support 8 times PCIe*16 slots … and have multiple M.2 SSD slots too…
Reply
Tim Dettmers says

2020-09-29 at 10:42
There is no regular server motherboard from desktop vendors that does that I
think. I think you need to go with specialized motherboards like those from
Supermicro. I have too little experience with servers to recommend a particular
motherboard. Usually, you just go with that you need and what is cheap and
support and warranty covers all issues. So you will get it working and keep it
working without any problem.
Reply
Christophe Bessette says

2020-09-17 at 14:22
Hi Tim,
I wanted to start by saying that I loved reading your GPU and Deep learning
hardware guide, I learned alot!
It still left me with a couple of questions (I’m pretty new when it comes to computer
building and spec in general). I’m mainly interested in Deep Reinforcement
Learning and, I read that for DRL, CPU is much more important then it is in other
fields of Deep Learning because of the need to handle the simulations. So i’m
wondering if going with a Ryzen 5 2600 is enough or I should go with something
which has more core, higher clock and/or supported memory. Also, with DRL, can I
get away with a cheaper GPU like the RTX 2060 or the GTX 1070. I’m not really on a
tight budget but i’m looking to make it the most cost-effective possible while not
being restrained too much by my PC.
I don’t know if it matters but i’m mostly trying to do Reinforcement Learning for
financial markets trading.
Thank you!
Reply
Tim Dettmers says

2020-10-13 at 18:09
Hi Christophe,
I think for deep reinforcement learning you want a CPU with lots of cores. The
Ryzen 5 2600 is a pretty solid counterpart for an RTX 2060. GTX 1070 could
also work, but I would prefer an RTX 2060 for DRL. You could also wait a bit for
the RTX 3060 and get a cheap threadripper to improve performance further.
However, that setup might be a bit more expensive and you have to wait
longer to get the parts.
Reply
Chris-Sij says
2020-09-15 at 14:35
I’m looking to build a home-based machine learning setup that will utilize transfer
learning and classification and apply findings comparatively to CT’s. I’ll have a
plethora of data but my actual data input size is incredibly small in single instances.
I’m looking to build a system that provides the most bang for my buck and have a
desire to build a machine around a Titan XP, if possible (or advised). There’s a
potential for getting a second Titan for future work if the single one is not enough
or up to the task later on. However, I’m unfamiliar with Nvidia based setups when it
comes to personal building so I’d love some advice on what kind of other parts I
should be looking to pick-up. I’m most likely going to be pairing this single Titan
with 32GB of RAM (2-16GB sticks), but am pretty much stuck after that point. I’d
appreciate any direction you could provide as this is all new territory to me and am
trying to avoid cloud-computing services like AWS for the time being.
Thank you in advance!
Reply
Tim Dettmers says

2020-10-13 at 18:25
Have a look at my other blog post about GPUs. There I have some “barebone”
setups for 2 GPUs which you can use a guide for your build.
Reply
John Heffernan says

2020-09-14 at 09:51
You really are something else! You have provided some exemplary resources for
me. Seeing the number of responses to comments you have is incredible. Thank
you very much for what you do!
Reply
Tim Dettmers says

2020-09-14 at 15:35
Thank you
Reply
darklinux says
2020-09-09 at 00:15
Hello, I am faced with a legitimate dilemma: I plan to create a cluster of computers

for machine learning, as part of my startup, but I have a very limited budget, I
hesitate between two options:
1: two RTX 2070s with a server running ryzen 5
2: four GTX 1660 super / ryzen 5
all with https://cnvrg.io/

Help !
Reply
Tim Dettmers says

2020-09-09 at 08:04
I would definitely go with two RTX 2070 Super. The memory on GTX 1660 is just
a bit small.
Reply
darklinux says
2020-09-09 at 18:41
thanks for your reply, i was thinking 3070, not 2070
Reply
Tim Dettmers says

2020-09-10 at 14:02
If it would be two RTX 3070s for the price of 4x GTX 1660 then
definitely go with the 2x RTX 3070!
Reply
darklinux says
2020-09-10 at 16:32
thank you , good weekend
Abhishek says
2020-08-24 at 22:54
Hi Tim
Your article is nice and informative. You have got really great experience with server
configurations. I had small doubt!
What kind of server configuration would be required to do video analytics on 30-40
4MP CCTV cameras simultaneously? Its a basically Boundary surveillance project. In
Video analytics, taslk would be to identify human, animal or bird. I am inclined to
use Intel processors in general.
What if no, of cameras are 12 ? What configuration would you suggest?
Thank you
Reply
Tim Dettmers says

2020-09-14 at 21:32
At 4MP and a framerate of 30 fps you have about 36 MB per second, taken
times 40 cameras that means about 1.4GB/s. The main problem here is to store
that data and pass it quickly to GPUs. An NVMe SSD raid would be very helpful
here. Otherwise, it depends on the network and the resolution that you want to
process. 4MP is pretty large and you definitely need to downsize images.
Downsizes images with YOLO can be processes at about 200 FPS which means
you need about 6 GPUs to process the data efficiently. These figures are for
RTX 20 GPUs, so I imagine 4x RTX 30 GPUs could work. If you reduce the frame
rate by 1/4 to 8 fps per CCTV you could process everything on a single GPU.
Reply
Xuan says
2020-08-12 at 23:00
Hi Tim,
Thank you for detailed description on building a Deep Learning machine. I would
request your suggestion if the config i am building will work out well or not.
CPU – AMD Ryzen 3900x (12core)

CPU Cooler – corsair H100i RGB
Motherboard – Gigabyte X570 aorus ultra
Memory – Corsair Vengeance LPX 32 GB (2*16) DDR4-3200
Storage – Samsung 970 Evo plus 500GB M.2-2280 NVME SSD and HDD 1TB
Graphic Card – ZOTAC RTX2080 ti
Power Supply : EVGA SuperNOVA G2 1300W
Case fan: Cooler Master Blade master 40.79 CFM 80mm Fan
I was wondering if I can add one more graphic card (2080 ti) to the above config.
Does the above motherboard support 2 graphic cards 2080ti?
Reply
Tim Dettmers says

2020-09-14 at 21:41
The build looks good! The motherboard supports multiple GPUs, so that would
be an option. If you only get a single GPU, you do not need a power supply of
1300W probably 700-800W would be sufficient if you go for a single RTX 30
GPU. With 2 GPUs, it makes sense to go for 1300W just to have a bit extra
room (also for a third GPU).
Reply
Alex says
2020-08-12 at 05:16
Hello Tim.
I’m looking now for good laptop to start with deep learning.
Can you advise me please, if HP Omen 15″ model with 7i Intel processor, 16 GB
RAM and
GPU Nvidia RTX2070 (8 GB) is good choice?
If it is not, what is the good laptop for your opinion?
Thank You in advance.

Alex
Reply
Tim Dettmers says

2020-09-14 at 22:04
I do not know much about laptops. There are many other things to consider
because laptops can be quite personal (battery life, weight etc.). In terms of
deep learning performance, i7, 16 GB RAM and RTX 2070 sounds very good for
a laptop. With that, you would definitely be able to do some pretty good deep
learning.
Reply
Keshav says
2020-08-10 at 04:40
Hi, I was planning to build a PC with the RTX 2060 Super. Should I wait for the 30xx
series in terms of price and performance or shall I go ahead and order it. Since RTX
20xx series is getting discontinued need to make a decision soon.
Reply
Andrew Hughes says

2020-07-26 at 04:54
Hi Tim,
I’m a little confused by this:
“Be careful about the memory requirements when you pick your GPU. RTX cards,
which can run in 16-bits, can train models which are twice as big with the same
memory compared to GTX cards. As such RTX cards have a memory advantage
and picking RTX cards and learn how to use 16-bit models effectively will carry you
a long way.”
Does this mean that because of the lack of precision the memory requirement is
halved, hence you can have a model which is twice as big for a cards with the same
RAM.
I’ve also read in other places about “models not fitting into memory”, what does
this actually mean?
What are we “Fitting into RAM”? Is it a combination of the model itself and the
data? Or just the model? Or just the Data? I thought using things like TF we load
things in batches anyway. So why does this matter?
Could you clear this confusion up for me?
Thanks
Reply
Tim Dettmers says

2020-09-14 at 22:08
The data usually takes up almost no memory since we, as you rightly pointed
out, only load one batch into GPU memory. Otherwise, it depends on the
model that you are working with. Convolutional networks are very small models
with very large activations while transformers are somewhere in-between (both
weights, gradients, and activations are large). Activations here refers to the data
representations which are passed through the network. These need to be

stored to compute the gradient in the backward pass.
Reply
Mira says
2020-07-01 at 07:28
Hello, I would like to build a computer with AMD 3rd gen cpu.
My demand is focused on pci-e lanes. It must to have a GPU x16 together with 2
LAN cards x4 + x1 and two m.2 SSDs (at least one at full speed x4).
My question is, would this works on mainstream AM4 MB with pci-e lanes working
at x16/x4/x1 (+ x4 for SSD)?
I don´t want to devaluate it to x8/x8/x1.
Reply
Tim Dettmers says

2020-07-03 at 07:49
I am not quite sure if that works. As I understand it, there are a lot of different
varieties of combinations of Ryzen 3rd gen CPUs together with PCIe 4.0
motherboard, but I think in any case one is just able to use 16x lanes for the
GPU if you just use the m.2 SSDs. If you add 2 LAN cards I think this will
downgrade the GPU to 8x lanes. I could be wrong about this, but as I
understand it, in many cases you still have “extra lanes” but these are no
distributed equally across all slots and using some components will draw away
lanes from the GPU.
Reply
Mira says
2020-07-07 at 23:38
If you check the manual:

https://dlcdnets.asus.com/pub/ASUS/mb/SocketAM4/PRIME_X570-
P/E15650_PRIME_X570-P_UM_WEB_V2.pdf
And there are other and similar, you can see, that for DUAL GPU there is
x16 + x4 lanes.
CPU provides x16 lanes

PCH provides x4 + 3*x1
Therefore in this case it should work. Is that correct?

Reply
Odyssee says
2020-06-09 at 07:33
Hi there, thanks for the study!

Do you have any benchmark recommendation to test all those facts? A benchmark
that could highlight the impact of an overclocked CPU, PCI lanes, …
Common benchmarks focus only on GPU comparison.
Thanks,
Reply
Tim Dettmers says

2020-07-03 at 07:28
Sorry I no longer have the code I think. I think I never uploaded it to github. I
used a Linux tool that downclocks the core clock rate and benchmarks
performance in taht way. For PCIe lanes you can just use NVIDIA’s CUDA
sample library for benchmarking.
Reply
TK says
2020-05-24 at 20:51
What about a laptop that is equipped with rtx 2070 super max p? Would it be
sufficient for deep learning? Understand that the mobile GPU is less efficient than a
desktop gpu but I’m working from two sites so having a laptop is definitely a win
for me.
Reply
Tim Dettmers says

2020-07-03 at 07:54
You can do many things with that GPU but not all models will fit into 8 GB of
memory and it will be about half as slow as a desktop GPU. If that is okay for
you it might be a good option.
Reply
Danilo Cominotti Marques says

2020-11-20 at 12:59
Tim,
I understand that the 8GB VRAM situation shouldn’t be a problem in terms

of “not being able to fit some models at all” if Linux and CUDA are used
and ‘per_process_gpu_memory_fraction’ >=2 and ‘enabling allow_growth’
= true in TensorFlow (because then it uses CUDA Unified Memory). Of
course there could be significant performance impacts, but then you’d be
able to fit bigger models. Any thoughts on this?
Reply
Tim Dettmers says

2020-11-22 at 22:27
Unfortunately, this feature will be too slow to train neural networks.

There are some similar techniques that you could use (memory
swapping) but these need to be programmed or tuned for any neural
network separately and do not work automatically. So with such
techniques, an 8 GB GPU memory will be enough even for the largest
neural networks, but unified memory does not allow for this right now.
Reply
lazy_propogator says
2020-05-24 at 03:44
Hello Tim, hope you are doing well. Can you help me choose between the AMD
Ryzen 7-3750H Processor vs an Intel i5 processor for the CPU? They are being
coupled with the RTX 2060 and the GTX 1660 Ti respectively. Would the AMD CPU
be a bottleneck for the RTX? Are there any potential problems which can arise in
the use of AMD CPU processors in deep learning?
Reply
Mayur says
2020-05-21 at 09:14
Hi Tim,
Thank you for detailed description on all the essentials in setting up Deep Learning
machine. I am currently building my DL machine and would request your
suggestion if the config i am building will work out well or not.
CPU – Intel Core i7-9700K 3.6 GHz 8-Core

CPU Cooler – Fractal Design Celsius s24 87.6 CFM Liquid CPU cooler
Motherboard – Asus ROG STRIX Z390-E Gaming ATX LGA 1151
Memory – Corsair Vengeance LPX 64 GB (4*16) DDR4-3200
Storage – Samsung 970 Evo 500GB M.2-2280 NVME SSD
Graphic Card – ZOTAC RTX2080 SUPER AMP 8GB GDDR6 ZT-T20820D-10P
Case: NZXT H500 ATX Mid Tower Case
Power Supply : Corsair RMx(2018) 850 W 80+ Gold Certified Fully Modular ATX
Case fan: Cooler Master Blade master 40.79 CFM 80mm Fan
I will add 1 TB HDD with above config.
Reply
Tim Dettmers says

2020-07-03 at 07:52
The CPU is a bit overkill if you want to just do deep learning. If you want to also
do other things with the computer it looks pretty good. This would be a well-
balanced build for Kaggle competitions for example.
Reply
Russell says
2020-05-19 at 20:33
Hi Tim,
Thanks for your article. It helped me to select parts for my rig.

I would appreciate some feedback on my selection.
https://au.pcpartpicker.com/list/JFhR8M
As the mobo has 1x x16 slot and 1x x8 slot and hence the 2nd GPU is only getting
PCIe 3.0 x8, should I get a different mobo that supports both x16?
Cheers,
Russ
Reply
Tim Dettmers says

2020-07-03 at 07:51
Looks solid to me!
Reply
Greg says
2020-05-02 at 12:13
Hi Tim,
thank you for the all information that you put in here.
however I have problem with choosing GPU for my motherboard which is ASRock
z270 Pro4. I am trying to upgrade my PC for software like Zbrush, Maya, Substance
Painter etc.
I am considering of buying one of these:
GEFORCE RTX 2060 OC REV2 6144MB GDDR6 PCI-EXPRESS GRAPHICS CARD
GEFORCE RTX 2060 VENTUS XS OC 6144MB GDDR6 PCI-EXPRESS GRAPHICS

CARD
the problem that I have is that on one website that I have looked at, these GPU
perform average with my motherboard.
I don’t want to waste my money so I would like to ask you if you know any website
where I could compare performance of GPU with a motherboard or if you have any
suggestions what GPUs would be the best for my motherboard.
Thank you and looking forward for your reply

Reply
Tim Dettmers says

2020-07-03 at 07:19
I think they should perform equally well on the motherboard. I am not sure why
it would be otherwise.
Reply
Marc says
2020-04-30 at 13:28
32GB vs 64GB of RAM. Given current RAM prices, is it worth just going for 64GB.
Doing work mostly in computer vision.
CPU: Ryzen 7 3800x

GPU: 2 x RTX 2070 Super
Reply
Tim Dettmers says

2020-07-03 at 07:19
Yes, I agree. RAM prices are fluctuating but right now RAM is pretty affordable!
Reply
Wen says
2020-04-20 at 21:49
Hi Tim,
Thanks for your post. I have a i7-3770 4-core desktop with16GB RAM. I was happy
to see Q/A above that it’s ok to buy a RTX 2080 ti. But then I read on another
website saying that power supply won’t be enough for GTX card above 1030 for a
motherboard of Optiplex 7010 which is what I have. I haven’t confirmed that PSU is
250W (for which I suppose I need to open the cover and check physically). Do you
know if that is true?
Thank you.
Reply
Tim Dettmers says

2020-04-26 at 21:40
That is true, if the PSU is only 250 watts you will not be able to run a RTX 2080
Ti on that. If you start upgrading the PSU though it might be worth thinking if it
is worth it or to build a new desktop entirely. Both options can make sense
depending on budget and other constraints.
Reply
Dimiter says
2020-04-11 at 04:47
Hi Tim,
I have an old X99 MoBo and Intel CPU (Intel I7-5930K) from 2015. One (or possibly)
both of them died, I cannot really troubleshoot without replacing them. I am
thinking of buying both, but the question is move to ADM vs stay with Intel: AMD’s
1920X or 2920x with X399 MoBo vs Intel CPU (not sure which one, comparable Intel
7900X is crazy expensive) and X299 MoBo. It seems much more economical to
move to AMD, even if it would complicate processor water cooling etc. What fo you
think?
Thank you.
Reply
sourav says
2020-03-19 at 09:58
Hi Tim,
I currently have a gaming pc with 1050ti, 8GB DDR3 RAM, and AMD FX6300
processor. I would like to upgrade to RTX 2060 SUPER/ RTX 2060, 16GB DDR4 RAM,
and an Intel CPU. I want to save money for the GPU, thus, I decided to use a
budget CPU. I am thinking about Intel Core i3-9100F (4 cores, 3.6 GHz, 65 W,
locked, 80$). This does not come with an integrated graphic (I will add the GPU). Is
that a good CPU for a single GPU build? Or, should I look for an old CPU &
Motherboard Combo under $120 on eBay? If yes, which CPU and Motherboard
would be a good fit for my budget?
Thank You
Reply
Tim Dettmers says

2020-04-03 at 19:51
I think the CPU should be more than fine for a single GPU. You should worry
more about other applications (CPU-based ML for Kaggle competitions, for
example) that might be bottlenecked by the CPU. You can also roll with a AMD
CPU which are now pretty cost-efficient and powerful but it would only make a
small difference.
Reply
Jochen van Osch says

2020-03-05 at 09:08
Hello Tim,
thank you for insightfull article. Maybe you can also give me some guidance on the
choice in GPU . I work in a hospital and want to start with deep learning projects on
high resolution image data-sets from MRI and CT.
Our budget for buying a PC is ~5000 euro.
Choice for GPU I now thinking about is: RTX-Titan RTX versus 1 (or 2) RTX 2080 Ti?
(in combination with a AMD Threadripper X399-a CPU. )
I think we will not run multiple projects at once, but I want to be, GPU-memory-
wise, on the save side GPU-memory-wise, and the Titan has 24GB.
Kind regards,
Jochen van Osch
Reply
Tim Dettmers says

2020-04-03 at 19:33
I would would definitely go with the RTX Titan! The memory will be a life-safer
if you work with medical images! Also make sure to invest in NVMe SSDs as
loading large unprocessed images can be a large bottleneck. I recommend
getting a motherboard that supports 3x NVMe SSDs and then get 3x of them
and setup a virtual SSD Raid 0.
Reply
Adam TS says
2020-03-03 at 13:15
I was wondering what the lower limit on RAM speeds is? I am looking at
repurposing old server hardware and have 64gb of 1333mhz DDR3 memory and
was wondering if this would be a bottleneck? Also I have committed to offsetting
my carbon footprint, and wanted to thank you for encouraging others to do the
same!!
Reply
Tim Dettmers says

2020-04-03 at 19:26
It can always be a bit tricky to re-purpose old hardware but if the computer
boots with the RAM stick then it should not be the biggest bottleneck. Since
you rarely use the RAM in deep learning training and since the RAM is usually
of similar speed to the PCIe bus it should not be a bit bottleneck. If you run
DDR3 memory with 4 GPUs the PCIe bus and the RAM should be of about
equal speed and you should only loose about 5-10% performance.
Reply
Pranav Lal says

2020-02-16 at 16:24
Hi Tim, I am stuck with a Nvidia GeForce GTX 1660. Do I stand a chance with this
model of GPU or do I need to buy something else? The problem that the ram is
only 6GB but I cannot afford anything more.
Reply
Tim Dettmers says

2020-04-03 at 19:14
It will be difficult but you can look up techniques to conserve memory. You will
probably also need to accept running smaller datasets and models.
Reply
Aaric says
2020-01-24 at 05:44
If I am to start out building supervised and unsupervised models, then do I really

need a graphics heavy computer?
Reply
Tim Dettmers says

2020-02-26 at 10:09
Yes, usually even if you just want to get started a GPU is required. A CPU can
be quite slow even for small problems.
Reply
Satchel says
2020-01-18 at 19:06
Hi Tim! I’m an undergraduate student primarily focused on Kaggle competitions

and personal projects – I have an old PC ( i7 870, 16GB DDR3 ) that I plan on
upgrading with a 1070/1070Ti GPU and some new age SSD’s.
The CPU fits your requirements (8 threads, 1 GPU) but is more than 10 years old and
only supports PCIe 2.0×16. How significant would this bottleneck be in your
opinion, and does it warrant an upgrade to a modern CPU?
Finally, I’m curious about your opinion on AMD Cards / ROCm stack, as I have
access to a R9 290 and Vega 54.
Thanks for an insightful article,

Satchel
Reply
Tim Dettmers says

2020-02-26 at 10:09
I believe PCIe 2.0 would be sufficient in your case but there is not enough data
to say that definitely. I would give it a try and upgrade your computer if it does
not work out. Theoretically it should be fine.
I would avoid the AMD cards still due to compatibility with software.
Reply
Satchel says
2020-11-22 at 19:54
Thanks for your advice! I ended up getting a 1070 and it worked pretty
decently. (A little faster than colab). Since then I actually entered into a
masters degree, and I’m using it quite a bit more often.
I’ve noticed when training that both my GPU and CPU usage are around
98%-100% (occasionally GPU at 94%) sometimes either one may be slightly
higher than the other but it tends to be more CPU bound (98% CPU, 94%
GPU, sometimes 100% on GPU)
To me this is a good signal that an upgrade is neccesary, but do you think

its a significant bottleneck? I’m curious if upgrading my CPU will
dramatically change performance.
Reply
Tim Dettmers says

2020-11-22 at 22:40
The high CPU percentages do not necessarily mean that your CPU is
utilized. Some libraries use active waiting, which will keep the CPU busy
with “empty” calculations. The GPU utilization is also not the true
utilization; it just means that all cores on the GPU are used (but not by
how much).
One test that you can decide on a CPU upgrade is to limit the
frequency on the CPU manually. This can be done with some CPUs on
Linux (for Intel, it is easy, for AMD, I am not sure, but it should be
possible). Then you can compare the performance with an
underclocked CPU. If it is much lower, then the CPU is a bottleneck. If
the performance is similar, the CPU is not a bottleneck.
Reply
Michel Gartner says

2019-12-25 at 15:49
Hi Tim,
I have doubts in order to choose a CPU. I know Ryzen 7 3700x or something like
that seems pretty good but I’m worried about the Intel MKL library’s issues.
I will buy a RTX 2070 super.
Which CPU do you recommend? Ryzen 7 3700x? i9 9900k? i7 9700k? As far as I

know, I9 and I7 only has 16 pcie lanes and that could be a problem
I mostly do deep learnign stuff but I also want to use my pc to some kaggle
competitions (mostly tree-based models that runs on cpu in sklearn)
Regards,
Michel
Reply
Tim Dettmers says

2020-02-26 at 10:06
MKL library issues are only for things like solvers, Fourier transform,
eigendecomposition. I am not sure if that is really that common for Kaggle
competitions and you would only be hit by a small penalty. I think Ryzen
processors are fine even in your case.
Reply
Victor Gorgonho says

2019-12-19 at 16:13
Hey Tim,
Could you please tell me which processor you think will fit better this pc i’m
planning to build:
NVIDIA GEFORCE RTX 2070 8GB

2x 8 GB 2666MHz Vulcan DDR4
B450 Asus Prime
HD 2TB 7200 + SSD 500GB M.2
650w PCYes Shocker 80 plus
My options right now would be either Ryzen 7 2700 or Ryzen 5 3600…
Also, is that good enough for a deep learning student? The lab i’m working on uses
computer vision to recognize LIBRAS, which is a Brazilian Sign Language…
Appreciate, and loved your article!

Regards,
Reply
Tim Dettmers says

2019-12-24 at 07:17
Both CPUs are more than fine for one GPU. I might just go with the cheaper
one.
Reply
Eric Bohn says

2019-12-16 at 09:34
For real world applications in a 2 GPU system running RTX 2080 Ti’s at most, is
there much difference between x8/x8 and x16/x4? Does it effectively make both
perform as a x4 to keep things in sync when running model/data parallelism?
Reply
Tim Dettmers says

2019-12-16 at 18:06
Both perform at about 4x speed.
Reply
Steven says
2019-12-15 at 18:08
Hi Tim,
I was wondering if you had any opinion on cube computer cases. I’m thinking of
exchanging my full tower case for a cube case to save space, but can’t find
anything that let’s me know what (if any) the cost for cooling my GPU (possibly
expanding to 2 GPUs) is. Do you have any knowledge or opinions?…
Thanks
Reply
Eric Bohn says

2019-12-13 at 14:48
Are 2x GPU for machine learning worth it? Should I buy a board now that allows for
2x x8/x8, or upgrade to Threadripper for multi-GPU later on?
Reply
Tim Dettmers says

2019-12-13 at 16:23
More GPUs are always better :). If you plan to go for 4 GPUs in the future, it
makes sense to get the Threadripper and the right motherboard right away. But
then you should ask yourself, do you really need 4 GPUs / is spending that
money justified?
Reply
Eric Bohn says

2019-12-13 at 17:25
My wife isn’t going to like that answer
On average, what kind of performance improvement can I expect when

going from one to two GPUs?
Reply
Atharva says
2019-12-13 at 01:24
Hey Tim,
Please tell me what do you think of this pc:
i7-9700K processor (3.6 GHz, 12 MB)
Hard Drive 2TB 7200+1TB SSD M.2
NVIDIA GEFORCE RTX 2070 8GB
will this be good for deep learning?
BTW loved your article!
Regards,
Reply
Tim Dettmers says

2019-12-13 at 16:25
I think that is appropriate. I also like Ryzen CPUs if you want to save a bit of
money. They would definitely also be more than enough for an RTX 2070 GPU.
Reply
Xiaopeng Fu says
2019-12-11 at 22:11
Hi Tim,
A lot thanks for the wonderful article and all the replies to our queries. Being new
to this field and preparing to build my own system, I’m wondering if it’s worth
waiting for Intel’s 10th gen desktop CPUs. Is it good idea, for example, to buy a
9900k + z390 board at this moment, knowing the board will not be compatible for
future CPU upgrades. Or maybe the 10th gen improvement will not make much
difference for DL…
Thanks!
Xiaopeng
Reply
Tim Dettmers says

2019-12-13 at 16:28
The CPU does not matter that much for deep learning. If you have some
workloads which require a better CPU (factorization, sklearn models, some big
data stuff ) then it might well worth it to wait. However, if you just want to get
started and do deep learning it might be better to just go ahead now — you
will lose almost no deep learning performance if you use a 9900k CPU.
Reply
Eric Bohn says

2019-12-09 at 12:16
Hey Tim, If you have a moment I’d be curious to know what you think about my
build and my reasoning behind it: https://pcpartpicker.com/b/3jw6Mp
Much appreciated.
Eric
Reply
Tim Dettmers says

2019-12-10 at 16:54
I think it looks good. two things though: The PSU with that high of wattage is
only needed if you want to expand to two GPUs in the future — think again if
that is really what you want. Otherwise, I would use the same NVMe SSD
instead of a small and a larger one. I guess you want to store the OS on the
smaller one and have the rest for data? The better thing is to use a small
partition for the OS and then use a virtual RAID 0 to create a single high-speed
device — this can make a huge difference if you work with very large datasets!
Otherwise, quite some few spinning disks, but if you need the space, then you
need the space.
Reply
Eric Bohn says

2019-12-10 at 17:13
Thanks Tim.
Yes, the original thought behind the 850W PSU was for two GPUs. That’s
also why I have that motherboard – for the x8/x8 configuration. Do you
think it’s reasonable to run two GPUs in this setup, or should I plan on
moving to a threadripper system when I want to go for the second GPU?
My intent for this machine is personal projects and a Masters program, but
I also want to be open to scaling for a small startup and/or consulting.
I’m not following on the virtual RAID 0 configuration. Do you mean run
Windows and Linux on the same drive on two partitions, and RAID 0 with
the second physical drive? Do you have a link?
Is there any reason to not go with a single 1 TB drive for OS and datasets vs
a drive for OS and another drive for datasets?
Reply
Tim Dettmers says

2019-12-13 at 16:30
I think startup stuff and consulting is fair with 2 GPUs. You want to get
GPUs with big memory though if you want to do startup stuff,
preferably a Titan RTX. On Linux a virtual RAID 0 is easy to
setup: https://www.digitalocean.com/community/tutorials/how-to-
create-raid-arrays-with-mdadm-on-ubuntu-16-04. Not sure about
Windows though.
Reply
Juliana says
2019-12-02 at 08:13
Hi Tim, Thank you so much for you work and for your helpful guides.
I was wondering if you would mind looking at my build project and helping me with
a doubt I have (regarding the CPU/motherboard combination).
I’ll use it mostly as a machine-learning/ deep-learning/ computer-vision

workstation. My mind is set on a RTX2080Ti (possibly adding a second RTX2080Ti in
the future).
Here’s the [PCPartPicker Part List](https://es.pcpartpicker.com/list/mQ46rV).
Picking the motherboard, my reasoning was:

– the 2-way SLI capability allows me to eventually add a second RTX2080Ti later in
a x8/x8/x4 PCIe configuration.
– with the X570 chipset, I can get the Zen2 microarchitecture of the Ryzen 3000
series without a bios flashing headache (even though the Graphic cards won’t take
advantage of the PCIe Gen 4).
All this results in a somewhat high-end motherboard.
The biggest concern I have is the motherboard/CPU combination: it feels weird to

spend less on a CPU than on its motherboard. Isn’t this a bit ‘Frankensteinish’?
There a 110 euros gap in Spain between the 6 cores Ryzen 5 3600 and the 8 cores
Ryzen 7 3700X.
Isn’t this a bit wasteful for a Ryzen 5 3600? Should I go for a Ryzen 7 instead?
Reply
Tim Dettmers says

2019-12-06 at 10:05
The CPU/motherboard combination looks pretty good to me for 2 GPUs. It is

totally fine if you spend more on a motherboard than a CPU. Over the past
years, motherboards kept getting more expensive and some CPUs, especially
AMD ones, good cheaper. However, the Ryzen 7 can make sense if you are
working with datasets that involve loading and preprocessing lots of data
(computer vision, for example ImageNet).
Reply
Michael says
2019-11-24 at 15:45
Hi Tim, which GPUs would get if you had $10k, and wanted to use them to train
large transformer-based models, at home. Note that at home you would have to
pay for electricity yourself.
I’m trying to decide between 8x2080Ti vs 4xRTX Titan X vs 2xQuadro 8000. Also
note that four RTX Titan X cards in the same chassis will overheat due to their fan
type, and I’m not very comfortable to water cool them.
Reply
Michael says
2019-12-06 at 10:16
No thoughts? I’m seriously thinking getting 4 RTX Titans and watercooling

them, but it’s a bit scary.
Reply
Hugo says
2019-11-22 at 10:54
Hi Tim,
Congratulations for your great work
What setup would you recommend for GPT-2 ( pre-trained language model ) latest
release (1.5b parameters) ?
I am intending to train this AI for my researches, but i am very unaware about the
hardware needed. I have read that numerous users have issues with still powerfull
setup.
Any idea ?
Sorry for my english, not my mothertongue
Best Regards.
Reply
Tim Dettmers says

2019-11-24 at 16:03
A minimum would be 4x RTX 2080 Ti. You might use very small batch sizes
though which is computationally inefficient, thus I would not recommend RTX
2080 Tis. I would recommend instead 4x Titan RTX which should have enough
memory so you can run GPT-2 and other transformers with a large enough
batch size.
Reply
Hugo says
2019-12-28 at 02:09
Thank you very much for your answer, Tim !
You opened my eyes on an important point regarding the transformers.

I am now studying the matter of batch sizes.
Better to start first a collab notebook and run tests before investing big
money…
Reply
Filip says
2019-11-13 at 15:19
Tim, your hardware guide was really useful in identifying a deep learning machine
for me about 9 months ago. At that time the RTX2070s had started appearing in
gaming machines. Based on your info about the great value of the RTX2070s and
FP16 capability I saw that a gaming machine was a realistic cost-effective choice for
a small deep learning machine (1 gpu).
I ended up buying a Windows gaming machine with an RTX2070 for just a bit over
$1000. I ended up modifying the cooling to get positive case pressure (took off the
front bezel blocking the airflow) and making it a dual boot Windows10/Unbuntu18.
As a Linux newbie one gotcha I found out was using a Windows file system results
in a performance bottleneck in Linux. So I added an SSD with Ext4 for data
preprocessing and that made a big difference.
It has been working great for learning deep learning (with pytorch) and Kaggle
competitions. I have found this local setup to be faster than Google Colab, Kaggle
kernels, and Azure notebooks and long runs are more reliable. The colorful case
lights are an added bonus!
Reply
Tim Dettmers says

2019-11-24 at 15:38
Thanks for your feedback! I think cases like this are pretty common. Some
setups will fall a bit short here and there but with a bit adjustments you can
quickly get a great system that fulfills most of your needs.
Reply
Anu Chandra says

2019-11-09 at 10:47
Hi Tim,
Thanks for the excellent material. I’ve been working with a 4x2080Ti workstation.
Some of the new GAN training work really requires 8x2080Ti. I’ve been looking at
server based reference designs – deeplearning11 and deeplearning12 from
servethehome.com. I don’t know a lot about servers but it seems (from youtube
videos) that they generate horrible fan noise when all GPUs are used. Have you
given any thought to a 8xGPU machine that can live comfortably in a home
environment? Any thoughts appreciated. Anu
Reply
Tim Dettmers says

2019-11-24 at 15:37
I do not think you will find a 8 GPU machine which you can comfortable house
in a home. If you have a small room far from other rooms (bedroom/living
room) you might be able to do it if you put some noise insulation into the
room and put the server there. It might just be a better idea to rent some
GPUs/TPUs in the cloud for whenever you need to run 8 GPU jobs. You can get
a 4 GPU for most other things and only use the 8 GPUs if you need them. Or
do batch aggregation to simulate 8 GPU training. Batch aggregation will just
double the training time for you, which should be alright and is doable in a
home environment.
Reply
Mike says
2019-11-05 at 19:43
Dear Tim,
Hello, I am going to do a UG project which going to do deep learning of Medical

imaging. And I am finding a perfect laptop can let me do the research. Do you
have laptop recommendation?
Reply
Tim Dettmers says

2019-11-24 at 15:33
I would not recommend laptops for medical imaging deep learning projects.
Usually, in medical imaging you will have images with very high resolution and
you will need a GPU which has the most memory that you can afford (Titan
RTX 24GB). You can buy a desktop and a small laptop with which you login to
the desktop when you are on the go. This would be the best solution.
Reply
vithin says
2019-10-17 at 07:50
Can I use multiple different GPU cards on a single CPU for deep learning?
Reply
Tim Dettmers says

2019-10-21 at 18:29
Yes, but you will not be able to parallelize across those GPUs.
Reply
Alex says
2019-10-07 at 12:10
Hi Tim,
Thank you for the post!
I am thinking of building PC and right now my build is:

CPU – AMD Threadripper 1900X 3.8 GHz 8-Core Processor
CPU Cooler – Corsair H100i PRO 75 CFM Liquid CPU Cooler
Motherboard – Gigabyte X399 AORUS PRO ATX TR4 Motherboard
Memory – Corsair Vengeance LPX 32 GB (2 x 16 GB) DDR4-3200 Memory
Storage – Samsung 970 Evo Plus 1 TB M.2-2280 NVME Solid State Drive
GPU – Asus GeForce RTX 2070 SUPER 8 GB Turbo EVO
Case – NZXT H500i ATX Mid Tower Case
PSU – Corsair RMx (2018) 850 W 80+ Gold Certified Fully Modular ATX Power
Supply
I am going to use it for training CNNs(Kaggle, not large projects). I’m also planning
to add the second GPU after some time. Is this build sufficient for my purposes?
Would you recommend any different processor? I’m a bit worried about the cooling
– is it enough for the current build and is something needed to be changed for 2
GPUs?
Reply
sakshi says
2019-10-01 at 00:08
I love your blog. This all provided knowledge are unique than other Deep learning
blog.
Good explain, keep updating
Reply
Michiel van de Steeg says

2019-09-30 at 10:06
Hi Tim,
I’m looking to build a machine with one 2080 TI, with the ability to expand it to a
second card. The difficulty I’m facing is that I want my machine to be quiet, but
watercooling is quite complex, expensive, and seems like high maintenance. Blower
fans aren’t particularly quiet.
Do you know if (on an x470 mobo) there are ways other than watercooling and two
blower fans that would keep the case cool enough? For example, how much would
hybrid cooling AIOs help? Say two of
these: https://www.evga.com/products/product.aspx?pn=11G-P4-2384-KR (bonus
points if you know whether any motherboard has headers for two aio pumps)? Or,
if I get an air cooled GPU now, would adding a blower fan to that allow for
sufficient cooling?
Hoping you can offer any advice on this issue. I’ve searched the internet but I think
my demands may be a bit too high / specific.
Reply
Tim Dettmers says

2019-09-30 at 10:43
If you just want to run two cards you can get a motherboard with at least 3
PCIe slots and use non-blower fans. Because you have a single PCIe slot which
is empty between cards cooling is usually sufficient and you can run a bit more
silent non-blower fans. Otherwise, AIOs can help. People have mixed reviews
about them, some reporting very low temperatures, others report similar
temperatures to regular fans. I think you cannot do much wrong with AIO GPUs
if you want silent performance, the 3x+ PCIe slot + non-blower fan option is
cheaper though.
Reply
Michiel van de Steeg says

2019-10-01 at 12:01
Thanks for the quick response!
What’s not quite clear to me, though: AM4 compatible motherboards / ATX
towers don’t seem like they would support this in terms of physical
dimensions. E.g. on the Prime Asus x470 Pro, there are 2 GPU slots with 3-
slot-width space, and the third slot only has a 1-slot-width space. I’m not
sure how I can manage to put a graphics card in the bottom slot. In many
cases the PSU or bottom of the case would be in the way, and most don’t
have the right number of expansion slots on the back. Am I overlooking
something?
Do you think a hybrid cooled 2-width GPU on the first PCIEx16 and an air
cooled 3-width GPU on the second PCIEx16 could work? That would mean
there’s 1 slot of space in between the two, with part of the heat from the
top one going to the radiator.
Thanks again!
Reply
Tim Dettmers says

2019-10-01 at 13:16
If you have the right case you can install a GPU on the bottom slot. It
only has a 1-slot-width, but in some computer cases, the GPU just
extends beyond the motherboard. If you look for cases that optimized
for GPU airflow you can probably find a usable case.
Not sure if the hybrid + regular fan would work out.
Reply
Ahmad says
2019-09-30 at 02:35
Dear Tim Dettmers,

Thank you very much for this blog. This information is really useful for upcoming
deep learning project. In my work place we are developing a server kind on of
system to run three deep learning projects. To run all the models in a concurrent
manner for 24×7, I need nearly 100 GB GTX 2080 Ti GPU. To maintain this GPU
what type of additional resources do I need. I need this GPU only for inference not
for training.
My question may look a little broad sorry for that. If you need any other
information please let me know.
Thank you in advance.

Reply
Tim Dettmers says

2019-09-30 at 10:39
If you require a large amount of memory (to hold different kind of models?)
and only want to do inference then working a CPU might actually be an
excellent option. For inference, in general, the software will be far more
important than the hardware. I think in terms of memory/dollar one of the best
options will be the RTX 2080 Ti or the RTX 2060 — but I am not sure if memory
is really your problem.
Reply
chanhyuk jung says

2019-09-29 at 07:16
I bought a rtx 2060 super. And I have a system with i5 3470. I added ram so it’s 16
gbs, It has an ssd and a hdd. with the 3470, only two cores are at a 100%. I can
upgrade to a 8500, but would it make a lot of difference?
Reply
Tim Dettmers says

2019-09-29 at 08:39
If you are using PyTorch try the flag OMP_NUM_THREADS=1 python

your_scripy.py. You can also try this if you are using TensorFlow but I am not
sure if it will help. The difference is mainly determined by your scripts. If they
load a lot of data you can benefit quite a bit from a good CPU. You can also
tweak the data loader threads and see how big the difference is for that.
Sometimes you can squeeze a bit more out of your CPU if you tune that
parameter. Depending on your script you can probably expect 0 – 30%
increased speed with an i5 8500.
Reply
Carl says
2019-09-15 at 13:33
Hi, I am considering buying a GPU for deep learning. If I understand this article
right there are different models of RTX2070 card. I am looking for 16-bit FP, but I
don’t see any information about FP. Could you tell me which parameter in the
specification should I pay attention to?
Reply
Tim Dettmers says

2019-09-16 at 14:29
Sorry for the confusion, but all RTX 2070 have 16-bit capability. You can pick
any card. If you have the money though, I would recommend picking the RTX
2070 Super over the regular one.
Reply
Fatih Bilgin says

2019-09-06 at 00:46
Hi Tim,
I’m trying to set up following PC for your guide. (For especially Kaggle -beginner-
competitions) Does it look good? Thank you.
*Asus Turbo GeForce RTX 2070 8GB 256Bit GDDR6 (DX12) PCI-E 3.0 GPU (TURBO-
RTX2070-8G)
*AMD RYZEN 5 2600X 6-Core 3.6 GHz (4.2 GHz Max Boost) Socket AM4 95W CPU
*ASUS TUF X470-PLUS GAMING AMD X470 AM4 Ryzen DDR4 3200MHz(OC) M.2
USB3.1 mOTHERBOARD
*Crucial 32GB (2x16GB) Ballistix Sport LT Gray DDR4 3000MHz CL15 1.35V PC Ram
*Corsair TX-M Series TX850M 80+ Gold PSU CP-9020130-EU (850W)
*Intel 660P 1TB 1800MB-1800MB/s NVMe M.2 QLC SSD
*Seagate Barracuda 2TB 2.5″ 5400RPM 128MB Cache Sata 3 HDD

*Thermaltake Level 20 MT ARGB CA-1M7-00M1WN-00 Black SPCC ATX Mid Tower
Computer Case
Reply
BrN says
2019-09-04 at 15:59
Hey Tim, really appreciate your post here. Has been a huge help. I’m currently
doing some deep learning application on MRI images using mostly
Tensorflow/Keras. I’d like to build a workstation with a 1 GPU set up for now with the
plan to up it to 2 GPUs in the future. I don’t think I’ll be going to the 4 GPU set up.
Wanted to get your thoughts on the CPU/PCIe lane situation (here is my

build: https://pcpartpicker.com/list/fvWZBb). I’ve had some people suggest I go the
AMD 3900x route to take advantage of quad channel memory and more PCIe
lanes, but wondering if the current set up would be good enough for a 2 GPU set
up in the future (i.e. add another RTX 2080 Ti sometime later).
Thanks!
Reply
Tim Dettmers says

2019-09-11 at 09:14
You might want to have a slightly bigger PSU if you want to run two GPUs. The
extra PCIe lanes are not worth it.
Reply
Winston Fan says

2019-09-02 at 17:13
Thank you for this great article~! It helps alot! I also found this
article(https://medium.com/the-mission/how-to-build-the-perfect-deep-learning-
computer-and-save-thousands-of-dollars-9ec3b2eb4ce2) which recommends to
use AMD ThreadRipper 2920X and here is the
build(https://pcpartpicker.com/b/nrjypg) . But I wend to UserBenchMark and found

Ryzen 7 3700X actually is newer, cheaper and performs better in most aspects.
https://cpu.userbenchmark.com/Compare/AMD-Ryzen-TR-2920X-vs-AMD-Ryzen-
7-3700X/m625966vs4043
My question is, should I keep everything same but just replace Threadripper 2920X
with Ryzen 7 3700X? or should I stick to TR 2920X? and Why?
Thank you!
Reply
Tim Dettmers says

2019-09-11 at 09:17
Make sure the Ryzen CPU supports the number of GPUs that you want to have.
If it does it is a great and cheap option!
Reply
Winston says
2019-09-11 at 19:15
thanks. I did my research and found that Ryzen 7 3700X supports up to 2

GPUs, which is fine for me.
So Ryzen 7 3700X is definitely a better choice for me
Reply
Sourav Banerjee says

2019-08-29 at 21:47
Hey Tim,
First of all thank you very much for your wonderful article with a great insight about
Deep Learning. It helped me to certain extent to understand the hardware
requirements needed for any DL machine. But I have an existing system with the
following Config:
CPU: Intel Core i5-2500K

HDD: Seagate ST3160812AS 41N3268 LEN 160GB
HDD: Seagate Barracuda 7200.12 500GB
RAM: Corsair XMS3 DDR3 1600 C9 2x2GB
MBD: Asus P8Z68-V
GPU: ? What should I go for as a best optimal upgrade for DL and NLP?
By the Way, Can this CPU perform good for ML or we need to rebuild the PC?
Reply
Tim Dettmers says

2019-09-11 at 09:18
The CPU will be fine if you use 1-2 GPUs. If you have more you need something
better.
Reply
Sourav Banerjee says

2019-09-11 at 13:02
Which GPU do you suggest? Do I need to upgrade my RAM?
Reply
Claus says
2019-08-28 at 09:45
Tahnk you! This is extremely helpful. I need to buy a multi GPU setup suited for
deep learning analysis of 3D radiological data – eventually several TB. What kind of
a setup would be recommended if I have about 30k€ available? Several of your
comments relate to smaller systems, so what are the key caveats for larger systems
like the one I need to buy?
A more specific question relates to GEForce 2080 ti versus Tesla VT100 (10x the
price!). Any killer argument for Tesla? More VRAM than 11GB needed in our case
Reply
Tim Dettmers says

2019-09-11 at 09:20
I would recommend a 8 GPU machine with 8x RTX Titan. Reach out to some
hardware vendors that offer these systems. It might be that for such a machine
the budged you need is slightly higher (32k euro). If this is a case a 4 GPU
machine with 4 RTX Titan is also great. RTX 2080 Ti has too small memory for
your application. The V100 is too pricey and not good!
Reply
Claus says
2019-09-12 at 07:42
Thank you very much indeed for your advice. In the meantime my
American colleague suggested to go for the V100, despite the price with
the following argument:
GE force cards are totally fine for 2D models in particularly if you want to
leverage imageNet transfer learning by cropping or resizing (first better) at
224×224 . However, NVidia advances tools as for example AMP automated
mixed precision might not be available on GE force and works just on V-
100. AMP allows you to train deeper models or larger training batch (faster
training) with limited memory footprint. If you are planning 3D data driven
models or multi-channel (informations from different sequences) I would
definitely chose V-100 32 Gb cards.
If we follow this advice we could only start with V100 GPUs and buy more
at a later point in time. Do you have a comment and could you elaborate
what you meant when stating V100 is …and not good. Thank you once
again!
Reply
Michael says
2019-09-13 at 17:27
Titan RTX is slightly slower than V100, and has 24GB of RAM (vs 32GB
in V100). It supports all the features of V100 (including AMP). Your
colleague does not know what he’s talking about. If you’re absolutely
sure that your model is not going to fit in 24GB or RAM even with
batch size of 1, then I recommend going with four Quadro RTX 8000
cards (48GB of RAM at $5,500).
Here’s a good system builder in US:
https://lambdalabs.com/products/blade
Reply
Tim Dettmers says

2019-09-16 at 14:31
If you think memory is a problem I suggest going with Quadro 8000

cards with 48 GB memory instead.
Reply
Lina says
2020-05-19 at 08:32
Hi Tim,
Thanks for your great article! I have questions, please can you
answer them?!
What about using multi Titan rtx instead of multi quadro 5000?
Which one will be faster? Also, I found that Lambda uses quadro
and Tesla instead of titan rtx for DL server, what is the point? Is that
just for double precision?
Thanks!
Tim Dettmers says

2020-07-03 at 07:24
Both are about the same. Lambda uses quadro because they make
more profit. It also might be that NVIDIA does not sell them RTX
cards anymore. There is a clause in the CUDA license that forbids
the use of RTX cards in data centers. So this could also be a
reason.
RB says
2019-08-27 at 15:38
Hi,
I am looking to do some entry level DL stuff and then build my way upto kaggle. I
would appreciate any feedback on the following machine.
https://pcpartpicker.com/list/GxyQCb
Thank in advance!
Reply
Francisco Paiva says

2019-08-23 at 06:06
Let’s say I decide to go with an Intel i9-9900KF, which has only 16 PCI lanes
available. In this scenario, I also used two GeForce RTX 2080 Ti GPUs. If I also use
an SSD, which requires 4 PCI lanes, would I still be able two for an 8x/8x setting with
the GPUs? Wouldn’t the system configuration be limited by the max number of
CPU PCI lanes, and so considering the SSD, the GPU would be forced to an 8x/4x
setting? In this scenario, would it be better to get just one GPU if a Intend to use
parallelism?
Reply
Tim Dettmers says

2019-09-11 at 09:21
Yes this is problematic. You can use a SATA SSD to solve this or another CPU.
Reply
Francisco Paiva says

2019-09-11 at 09:30
Ok! Thank you for the reply =)
Reply
Eric Bohn says

2019-12-08 at 20:18
Doesn’t this depend? The chipset has PCIe lanes in addition to the CPU
right? Therefore, if the m.2 is on the chipset then it wouldn’t take away
from the x8/x8 used by the GPUs?
Reply
Tim Dettmers says

2019-12-09 at 09:40
Yes, you are right. I got it wrong the first time around. Most often your
motherboard will provide the PCIe lanes for the PCIe storage and thus
it does not take away from the GPU PCIe lanes.
Reply
Eric Bohn says

2019-12-09 at 11:25
Thank you for clarifying.
lazy_propogator says
2019-08-17 at 21:51
Hello Tim, this is a great article! Thanks for all the info. NVIDIA recently released the
Super versions of the RTX cards, can you shed some insight on that? They are
supposed to be more powerful than their processors, its said that the RTX 2060
super is almost as good as the RTX 2070. But on the other hand there are reports
that the RTX 2080 Super is only slightly better than the RTX 2080. Can you shed
some light on this?
Thanks
Reply
Tim Dettmers says

2019-09-11 at 09:24
I have not analyzed the data of the GPUs yet. What you say seems accurate
from my first impression though. So RTX 2070 and 2060 Super are good. RTX
2080 Super not so much.
Reply
Steven says
2019-08-17 at 12:01
Hi Tim,
Thank you for sharing! I’m looking to build a desktop for prototyping. What are
your opinions on Intel i5 9600k vs i7 9700k? Or do you recommend something
else? Also can you recommend a good compatible motherboard? I will be using
one RTX 2070 for now but would like to be future proof for up to 4 one day.
Thanks!
Reply
Tim Dettmers says

2019-09-11 at 09:25
If you want to have 4 GPUs consider a CPU with at least 32 lanes and about 6-
8 cores.
Reply
Nick Jonas says

2019-08-14 at 20:26
Will this build provide enough airflow for a 9900K + 2080 Ti? It’s all air cooled, 2080
Ti model has an open-air design with two fans (can buy 3 fans if needed).
PCPartPicker Part List: https://pcpartpicker.com/list/w7rRfH
CPU: Intel Core i9-9900K 3.6 GHz 8-Core Processor (Purchased For $485.00)
CPU Cooler: Noctua NH-D15 82.5 CFM CPU Cooler (Purchased For $89.95)
Motherboard: Gigabyte Z390 AORUS PRO ATX LGA1151 Motherboard (Purchased
For $144.99)
Memory: Corsair Vengeance LPX 16 GB (2 x 8 GB) DDR4-3200 Memory ($84.99 @
Amazon)
Storage: Samsung 970 Evo Plus 500 GB M.2-2280 NVME Solid State Drive
(Purchased For $109.99)
Storage: Seagate Barracuda Compute 2 TB 3.5″ 7200RPM Internal Hard Drive
($54.99 @ Amazon)
Video Card: Zotac GeForce RTX 2080 Ti 11 GB AMP MAXX Video Card ($1099.99 @
Amazon)
Case: Cooler Master MasterCase H500 ATX Mid Tower Case ($99.99 @ B&H)
Power Supply: Corsair RMx (2018) 850 W 80+ Gold Certified Fully Modular ATX
Power Supply (Purchased For $94.49)
Reply
Tim Dettmers says

2019-09-11 at 09:26
One GPU builds usually have no cooling issue. Airflow is not that critical. It is
more about what kind of cooling system you have on the GPU.
Reply
Bill says
2019-07-31 at 11:21
Followup to my q, here is the config I’m looking at. Partspicker doesn’t seem fully
current, noting Asus Rampage IV but not V is shown.
Reply
Bill says
2019-07-31 at 10:28
Any advantage of EEB over EATX, e.g. these two?
Does the ‘WS’ do 3 * x16 GPU’s, or only one? It seems to have a max of 3 GPUs, vs.
4 for the ROG. Partspicker lists the ROG IV (no V) without sellers, and the price of
the IV is $700+ on Newegg.
ASUS WS C621E Sage EEB Server Motherboard Dual LGA 3647 Intel C621
3 x PCIe 3.0 x16 (x16 mode)
2 x PCIe 3.0 x16 (Single at x16, dual at x8/x8)
2 x PCIe 3.0 x16 (x8 mode)
ASUS ROG RAMPAGE V EDITION 10 LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.1
Extended ATX Motherboards
4 x PCIe 3.0/2.0 x16 (x16, x16/x16, x16/x8/x8, x16/x8/x8/x8 or x8/x8/x8/x8 mode with
40-LANE CPU; x16, x16/x8 or x8/x8/x8 mode with 28-LANE CPU) *
* The PCIEx8_4 slot shares bandwidth with M.2 and U.2.
Reply
Gowri says
2019-07-03 at 20:22
Hello Tim,
Thanks so much for the blog and replies to comments. I am sorry if this is a
reposting, but my comment seemed to have disappeared, so thought I would post
again… It would be so helpful to have your insights.
I am attempting to put together a desktop with what I have available online and
locally, that is both DL-now ready and future proof. These are the components with
some questions:
1. NVIDIA RTX 2080 (8GB)

(not sure which one is right, there seem to be multiple versions online such as RTX
2080 Super and Twin x2 8GB – which would be most appropriate here?)
2. RAM – Corsair Vengeance DDR4 32GB
(2x16GB (instead of 4x8GB) seems to cost less – would this be alright?)
3. Intel i7 8700k processor
(Selected this inspired by TensorBook’s choice of processor for a laptop; not sure
this is the best for the configuration selected, please do let me know if there’s a
better option)
4. 1 TB SSD (Samsung 970 Evo?) NVMe
5. ASUS mother board (an appropriate one)
(Would the ROG-Strix-Gaming-Motherboard-802-11ac/dp/B07HCPLQ2H be good
for DL too?)
6. Power supply – Corsair smps cx750
(we have occasional power cuts, so thought this is a worthy investment)
7. Hard disk for data (Seagate 2TB Fire Cuda)
8. Cabinet – Corsair Crystal 570x RGB 3 RGB fans
(Not sure if Mid Tower is sufficient for the config selected – is there a better option?)
Please would you share your inputs on these?
Thanks a ton for your time and help!

Reply
Tim Dettmers says

2019-08-04 at 13:48
I think this looks reasonable. You could go with a cheaper AMD processor
(Ryzen) to save some money. 2x 16 GB are great. Looks good otherwise!
Reply
Dmitry says
2019-06-14 at 11:21
Hi Tim,
First of all thanks a lot for the post – it saved quite a bit of time for me.
I’ve got a bit oldish machine with i7-3770K (4 cores + hyperthreading) and 32Gb
DDR3 RAM which I’d like to start using for Deep Learning (for NLP tasks).
Looking at your other post I am thinking about getting one RTX 2080 Ti though not
sure if my CPU would become a bottleneck and I better go for cheaper GTX 1080 Ti
instead.
What would you think about this?
Unfortunately most of posts on internet on this are from gaming perspective and
do not look too relevant…
Many thanks,
Dmitry
Reply
Tim Dettmers says

2019-06-15 at 10:48
You should be fine with an i7-3770K for most tasks. Some tasks that make
heavy use of background data loaders such as computer vision can take a hit in
performance, but it should be not too much, maybe 30-50%. If you compare
this to getting a full new system sticking with your i7-3770K looks like a quite
cost-efficient solution. I would give it a go!
Reply
Dmitry says
2019-06-15 at 13:21
Thanks Tim! Much appreciated
Reply
Daniel says
2019-06-11 at 06:15
Hello,
my name is Daniel, I am a student and using for the first time the PyTorch library
with Cuda. I was trying to train a network and came across some problems, and
hope you could help me out.
The setup I am currently using might be a little unusual. I am using a NUC7i7BNH,

that has a Intel Core i7-7567U Processor @ 3.50 GHz and 8 GB of RAM.
Additionally I have an external GPU, a Nvidia Titan V built in a Asus Rog XG Station
2 and connected to my NUC through Thunderbolt.
I went through some PyTorch tutorials and had seemingly no problems with this
setup. Nevertheless, when I try to train a bigger network with a big image dataset,
the CPU runs constantly at 100% and the GPU only at 0-5%. I have been trying to
find out what the problem is. I checked several times that my code is actually using
Cuda, but the CPU is still running at 100% and making the training progress
extremely slow.
From what I have read, I suppose that it should be a CPU bottleneck problem, but
wanted to confirm. I also looked at the RAM usage and it seems to stay between
85-90% during the training. Maybe it has also something to do with the fact that I
am using an eGPU?
Thanks in advance!
Reply
Tim Dettmers says

2019-06-13 at 20:38
That sounds like an eGPU issue where the bottleneck is to transfer the data
iteratively to the GPU. One solution might be, if your dataset is not too large, to
transfer the entire dataset to your GPU. This will take some time, but once this
transfer is complete the CPU should no longer be a bottleneck since almost no
operations are executed on the GPU. If this does not solve the problem
something else might be wrong. You can run PyTorch profilers to find out
where the bottleneck comes from exactly.
Reply
Daniyal Mujtaba says

2019-06-04 at 09:03
Hi a great blog tbh and really helpful in deciding the most of the system for DL but
still i need one advice in terms of GPU
So i m going for i7-4790K 16 GB ddr3

now its the gpu time
Actually I m new to DL and i dont have much idea what GPU should perform
better..
Rtx 2060 6gb ( I read somewhere i can use it in 16 bit mode to virtually get a 12 GB
bandwidth for processing )
1070 ( 1080ti is out of budget right now )
that is option one with future possibility of upgrading it to a better cpu with ddr4
ram
or i can go for better cpu like 6700K 16 gb ddr4 and settle for 1060 6gb and in near
future add another gpu or upgrade this one to a 2070 maybe
but that upgrade might take more than a year
So yeah suggestions would be helpful
and this machine wont be only used for DL as i would game on it too.. so yeah
Reply
Tim Dettmers says

2019-06-13 at 20:26
I would go with the RTX 2060. If you learn to use it well you should be able to
use most deep learning models.
Reply
Al says
2019-06-15 at 04:37
I have to disagree with your assessment on 16-bit fp capable cards. Most

DL frameworks are garbage as far as software goes, and support for 16-bit
is still crippled. “learn to use it well” in practice means you need to spend
much more than the card’s worth of man-hours over its lifetime to make
16-bit work in many cases, which completely defeats its purpose. I think
your performance graph is misleading. Pricing will get outdated very
quickly too, as Nvidia will screw us shortly with the new RTX “super” pricing
drop and second hand Pascal cards keep going down in price.
I just got a second-hand Pascal card myself after trying a Turing RTX for a
few months. More RAM for a lower price, a much better deal than a RTX
2060. I’m mostly stuck with 32-bit in most cases anyway. Maybe in a year
or two 16-bit will actually be usable!
Reply
Tim Dettmers says

2019-06-15 at 10:45
In PyTorch, the 16-bit recipe is quite easy to do and stable. I have no

16-bit experience with TensorFlow though.
Reply
Al says
2019-06-29 at 08:50
From what I’ve seen in Keras some layers -which you’re prone to
end up needing to use like batch norm- don’t support 16-bit. Also,
the results are often out of whack, with big losses that don’t
improve and so on, even following the guidelines. When it works
it’s great but when it doesn’t you just can’t justify spending the
time to make it work.
Another annoying side effect of faster cards with the same amount
of RAM is that you may often find that you’re under-utilising the
compute capacity. So why use a “fast” RTX at 50-70% capacity
when you can use a cheaper GTX at 80-100% and get the same
results in the same time? This happened to me recently when
using cudnn LSTM and GRU Tensorflow layers from C++ for
inference but it can happen in many other cases.
And don’t get me started with the Python image manipulation

garbage software… you often find that you can’t make it run fast
enough for augmentation online and parallelization is not
possible/sucks and so on… so there you have your powerful GPU
waiting for data from crippled software that can’t even use your
CPU cores properly. You may as well take charge, throw the Python
garbage out the window and use opencv and cudnn in c++, which
will be much more fun if you have the time (which you often
won’t), because you’re solving actual problems and not banging
your head against a wall. This has nothing to do with the GPU
choice, I’m just ranting at this point…
Tim Dettmers says

2019-07-02 at 16:16
I have not used Keras in years and I am not sure how to resolve
16-bit problems in Keras. In PyTorch, it is rather easy and works
well. I think this is primarily because NVIDIA is supporting 16-bit
compute with specialized libraries for PyTorch.
I totally agree with the underutilization issue. GTX cards or XX70

cards are more than sufficient for LSTM-like models and if people
use these heavily I would recommend such cards.
Indeed, Python is a nightmare in terms of parallelization. I worked

on this problem myself and was able to do much better than
standard software like TensorFlow/PyTorch for toy problems like
MNIST/CIFAR where I increased training time by 4x because I just
had better data loaders. However, I did not have time to
implement all the models/layers from scratch and so I gave up on
the project. Developing such software is just difficult and a huge
undertaking. I think we should be quite grateful for the great free
deep learning software that we have even though it is inefficient at
most times.
Michael says
2019-07-02 at 17:44
Re 16 bit: in my experience issues arise when you’re doing some

experiments where you’re not sure how the gradients are going to
behave, or perhaps regularization might push weights/activations
to the range beyond 16 bit range. Basically standard
over/underflows. If you put safeguards in your code to handle
these situations 16 bit works fine in both PT and TF.
Re under-utilization of GPUs, try https://devblogs.nvidia.com/fast-

ai-data-preprocessing-with-nvidia-dali
I personally haven’t tried it yet, but it might help.
Daniyal Mujtaba says

2019-06-19 at 13:30
but i m getting a 1070 second hand with same price of 2060 in my

country.. so 1070 is better according to you?
Reply
Tim Dettmers says

2019-06-29 at 08:22
They are about the same. Getting either one should be fine.
Al says
2019-06-29 at 08:54
You can find a second-hand 1070 for about 200 euros on eBay
now. I think it’s impossible to find an RTX 2060 for that price
anywhere. For the prices I’ve seen, the 1070 is a bit better value
even though it’s a bit slower. And it also has 8GB…
Zarglayoun Amira says

2019-06-01 at 20:30
Hello Tim,
A friend of mine has bought the followings based on a build found on the internet:
CPU : Intel i7–8700K
RAM : 64GB : 16*4: G.Skill Ripjaws V DDR4 3200MHz
GPU : RTX 2080 Ti
Motherboared : MSI Z370 PC PRO
PSU : CoolerMaster Vanguard 1000W PSU
Cooler : ML240L Liquid Cooler with the Hyper 212 LED Turbo.
Storage : A 512GB 970 Pro Samsung M.2 SSD
The total cost was about 3500$
If I have only 1000 to 1500$ and if my aim is to have a decent build to go on kaggle
competitions (I am not looking to be at the top5 but let’s say around the top 100-
150), how can I change his build, do I keep the GPU ? the RAM ? etc
PS : I can buy used components to reduce the costs

PS2 : I already have the following (I don’t know if it can be useful)
Crucial CT2C8G3S160BMCEU 16Go Kit (8Gox2)
Samsung SSD 850 EVO, 500 Go – SSD Interne SATA III 2.5″
Reply
Tim Dettmers says

2019-06-13 at 19:54
You do not necessarily need the RAM, but then you need to write careful code
that is memory efficient. This will already save you a lot. The PSU wattage is
suitable for 2-3 GPUs and can be reduced to 600 watts which will bring the
price down further. Otherwise, one can go for a cheap GPU. For one GPU a
cheap ryzen CPU with a cheap motherboard is more than enough. All of this is
for deep learning though, to run models on your CPU would be slow on this
setup, so you would need to make sure that you run boosting/tree models on
your GPU.
PS: Yes the used components look good!
Reply
Ariel says
2019-05-30 at 08:17
Hi Tim, I’m reading your post as I’m about to build a deep learning machine.
I’m planning to get me a ASUS Turbo GeForce RTX 2080 8 GB Graphic Card GDDR6
with High-Performance Blower-Style Cooling for Small Chassis and SLI Setups
TURBO-RTX2080-8G ( https://tinyurl.com/y4y68dub ) and add : Patriot Viper 4
Series 16GB Kit (2 X 8GB) 3733 MHz (PC4 29800) DDR4 DRAM Kit (PV416G373C7K)
(https://tinyurl.com/y5k77pva ) and I’m waiting for the New AMD CPU that just
been announced, the Ryzen 9 3900X with 12 cores and 24 threads. would you
recommend?
thanks -Ariel
Reply
Tim Dettmers says

2019-06-13 at 19:49
The AMD CPUs are quite good now but the 3900X has only 16 PCIe lanes so
only good for 2x GPU setups. If you only want two GPUs this is a great choice.
Otherwise, a Threadripper is a cost-effective option for 4 GPU setups.
Reply
Ariel Ravinovich says

2019-06-15 at 10:52
can you recommend any other AMD chipset that will go with this
confirugation?
thanks
Reply
Tim Dettmers says

2019-06-29 at 08:13
A cheap threadripper usually works well.
Reply
Ariel R. says
2019-07-09 at 02:54
Hi Tim, is the following configuration will work?

https://pcpartpicker.com/user/ArielR/saved/#view=pmgrVn
I found a motherboard with 3 slots for video cards as you can see
and will fit to the ne AMD CPU
Thanks
Tim Dettmers says

2019-08-04 at 13:41
I would get lower clocked RAM to save a bit of money as well as a

air cooler for the CPU (there are some silent with high
performance which cost a bit less). Looks good otherwise!
Ariel Ravinovich says

2019-07-15 at 00:24
Hi Tim what would you say about this configuration :

https://pcpartpicker.com/list/
I think the motherboard is not bad with 3 slots of 16.
thanks
Richard says
2019-05-23 at 07:36
Could you discuss the hardware implications of the type and application of deep
learning? Different hardware tradeoffs could be made for a box dedicated to
training an image classifier on a large dataset versus transfer learning with an
existing model and these hardware tradeoffs might be different if the application
was sentiment analysis or NLP. I suppose one way to determine those tradeoffs, as
alluded to in an earlier comment, would be to run the task in the cloud and get an
understanding of the bottlenecks and requirements that way before buying
dedicated hardware.
Reply
Tim Dettmers says

2019-06-13 at 19:42
For image classifiers, it is useful to have a large SSD where you can put your full
dataset on (1TB+). Other than that, there are no tasks specific requirements.
Reply
Karan Sharma says

2019-05-09 at 06:46
Hey,
Is RTX 2070 good enough to start with if I want to train architectures like YOLO
(object Detector) using Tensorflow.
Please help me with this.
Reply
Tim Dettmers says

2019-06-13 at 19:28
Yes, the RTX 2070 will be good enough for this.
Reply
Arthur says
2019-05-07 at 00:40
Hi Tim,
Since you have great experiences on building this kind of DL machine, I have a
question regarding how to optimize the Ethernet bandwidth on different HW
configurations and applications. For example, how many Gigabit or 10 Gigabit
Ethernet I need if I have 16 or 12 NVIDIA Tesla GPUs with 2 Intel Xeon Scalable
Processors for graphical analysis or gaming processing application? Thanks.
Reply
Tim Dettmers says

2019-06-13 at 19:27
If you have 4 GPUs per node and you want to train traditional convolutional
networks with standard algorithms you should get at least 20-40 GBit/s
Infiniband, preferably 100 GBit/s or faster. 10GB/s and especially Ethernet will be
too slow for standard algorithms. With special algorithms, 10 GBit/s ethernet
can work but no open source project for these algorithms exists and
implementations on your own will take months. So it is better to invest in a
good networking solution and use standard libraries.
Reply
Chanhyuk jung says

2019-04-30 at 10:36
Hi. I’m a high school student graduating this year. I completed the deep learning
specialization on Coursera. Now that I’m confident enough to use pytorch for nlp
and RNNs on speech, I need a gpu. I can ask my parents to buy me a computer or
just use google colab. Would it be okay to just use colab even if I can afford a
computer?
Reply
Ade says
2019-04-24 at 11:27
Hello Tim
Please, I am a Ph.D. student and research area is deep learning . My potential build
is the following specification.
CPU: Intel® Xeon® Silver 4114 10-Core (2.2 GHz, 3.0GHz Turbo, 13.75M L3 Cache)
Motherboard: ASUS® WS C621E SAGE (DDR4 RDIMM, 6Gb/s, CrossFireX/SLI).
RAM: 64GB Kingston DDR4 2666MHz ECC Registered (2 x 32GB)
GPU: 11GB NVIDIA GEFORCE RTX 2080 Ti – HDMI, 3x DP GeForce – RTX VR Ready!
1st Storage: 6TB SEAGATE BARRACUDA PRO 3.5″, 7200 RPM 256MB CACHE
1st SSD Drive(OS installed) : 1TB SAMSUNG 970 EVO PLUS M.2, PCIe NVMe (up to
3500MB/R, 3300MB/W).
Please, is it enough for Image processing, training. I have 3 million images to train
(2 TB image dataset). Any suggestion on areas to improve on my build. The
motherboard supports 2 CPUs, up to 756GB RAM, as well.
Thanks
Ade
Reply
Tim Dettmers says

2019-04-27 at 09:20
The SSD is really important here for image processing. Try to get multiple SSDs
and raid them in raid0 or buy at least one SSD which you solely use for your
dataset. Otherwise, it looks good.
Reply
Nick says
2019-03-29 at 04:11
I feel this guide is becoming obsolete because it ignores alternatives to GPUs, like
the Google TPU. There are conflicting claims but it does seem clear that chips which
were designed for accelerating deep neural networks are going to be better than
chips that were designed for accelerating graphics cards. There are at least half a
dozen companies with TPU-like products in the pipeline, it’s not just Google.
Reply
Tim Dettmers says

2019-04-03 at 12:43
Please have a look at my updated GPU recommendation blog post which also
discusses TPUs.
Reply
Damian E says
2019-03-27 at 05:10
Hi,
I need a certain level of mobility so I want to go with a 17″ laptop with eGPU via
Thunderbolt 3.
I would like to know if it makes sense to purchase a laptop that already has an
integrated GPU (mobile rtx 2070-2080)? Can they work in pair? Or do I have to
switch between them and thus make the integrated one useless?
Also, Thunderbolt 3 caps and 40Gbs to PCIE and that is most likely the theoretical
maximum, not necessarily what you get. Does it make sense to go with the TRX
Titan? or and I burning money and should go with 2070?
Reply
Tim Dettmers says

2019-03-27 at 20:38
Integrated GPUs are great but also expensive. If you find a cheap laptop with
integrated RTX 2070 I would go for that. If you want to have multiple GPUs
(internal + external) it gets complicated. I am not sure how this setup is
supported. I would look online for other people who tried. In general, a single
eGPU should also be great. It is also cheaper to upgrade the GPU without
upgrading the laptop!
Reply
silverstone says
2019-03-23 at 05:51
Hi Tim,
Thanks for the guide. What do you think about AMD vs. Intel CPU with NVIDIA
GPU? Are there any bottleneck for DL frameworks with AMD CPU?
Reply
Tim Dettmers says

2019-03-24 at 16:15
It seems AMD CPUs are fine. I never had any problems with my AMD CPUs
both at home and in the office. One issue might be if you want to use your
CPU for some linear algebra (solvers and decomposition etc.), but other than
that AMD CPUs are great.
Reply
Hamid says
2019-03-24 at 16:55
I actually ended up purchasing intel and X299 Mark 2 motherboard, but

after I assembled I realized it’s so lousy that it does not have even a basic
graphics card, so do I need to buy a cheap graphics card or the GPU itself
can be used without sacrificing the processing power?
Reply
Vinicius Dallacqua says

2019-03-20 at 09:36
Are you aware that this article got

cloned? https://medium.com/@joaogabriellima/a-full-hardware-guide-to-deep-
learning-cb78b15cc61a
Reply
Tim Dettmers says

2019-03-24 at 16:05
Thank you for pointing that out! That was caught quite quickly and the user is
now banned.
Reply
Kim Goodwin says

2019-03-18 at 02:29
As reading the title I seriously didn’t think that the article going to be this much
deep on topic. Great start!!!
I am tired of trying different coolants for my processor and heatsinks. Now I have
decided to use the thermal paste instead. Is that a good option?
Reply
Kartik says
2019-03-16 at 21:53
Hi Tim,
Thanks a lot for all your effort and also keeping this blog up to date.
I was thinking of taking an AMD threadripper 1900 which would be used for heavy
preprocessing and running xgboost or other libraries which run on cpu. Is it
overkill?
And ive been following Kaggle for a long time was doing other things and now i
am full on it. Ive seen people doing prototyping and training seperately.
Should i get an rtx 2080ti or two rtx2070 for training and prototyping ? Or maybe
make a cluster of gtx 1080ti ??
Reply
Tim Dettmers says

2019-03-24 at 16:00
An RTX 2070 is great for prototyping. Since most of the time on Kaggle is spent
prototyping it is not so efficient to dedicated resources for training. I would say,
use your RTX 2070 also for training and if that is not sufficient (memory or
training time to high) use the cloud for access to fast GPUs. This will be cheaper
and more flexible.
Reply
Hamid says
2019-03-16 at 18:04
Hey Tim,
Thanks for this great post,
I’m looking for a GPU do to my own research and I’m thinking of price range
$2200-$2500. Given that a while is passed may I ask what would you choose for 1.
2080 GPU (Ti?) 2. Motherboard, 3. Ram, 4. Hard disk 5. PSU and 6. Case?
I mainly use this for deep learning.
Thanks
Reply
David Knowles says

2019-03-10 at 14:08
Thanks a lot for the post, very useful.
Do you have advice on how bad an idea it is to have two different GPUs in the
same box? I have an old GTX 980 that it seems a shame to waste so was thinking of
running that alongside a RTX 2070 in a 2 GPU setup. I would potentially use the
980 for prototyping things while the 2070 is off training. Thanks!
Reply
Tim Dettmers says

2019-03-11 at 19:39
That is usually just fine. Make sure that you use software that is precompiled for
different compute architectures (different GPU series) and you should have no
problem.
Reply
Farooq says
2019-02-24 at 01:30
Hi Tim,
Extremely helpful article ! Keep it updated please !
I wanted to take your opinion on buying a single GPU e.g. GTX 1080-Ti today
priced around $808 vs buying two GPUs e.g. RTX 2070 (single GPU priced $527,
total = $1045).
Will 2 GPUs (RTX 2070) perform better as compared to single GPU RTX 2070?
Generally, does two slightly slower GPUs perform better for Machine learning
projects as compared to a single high speed GPU?
Reply
Tim Dettmers says

2019-03-11 at 19:27
Usually, two GPUs that are slightly slower are better than a big GPU, because
you can run multiple hyperparameter configurations of the same network on a
GPU. Parallelization is also an option and is usually slightly faster than one big
GPU. So go for the RTX 2070s!
Reply
rocco says
2019-02-20 at 22:39
Hello Tim,
I am going to use Asus WS X299 SAGE motherborad with 2X Rtx2080Ti. If I use dual
fan GPUs (2x Rtx2080ti), Is it better compared blower style fan GPUs ? Actually, I am
afraid of use blower fan GPUs for heating problem. Using two GPUs, I think there is
enough space between them.
Reply
Tim Dettmers says

2019-02-22 at 12:14
Yes, if you have space between your GPUs a dual fan will be fine, but probably
comparable to the blower fan.
Reply
Diego says
2019-02-20 at 12:14
Hi Tim
Thank you for your guidance, for those interested in the development of AI.
I have a big question and it is the following:
I want to buy an MSI Vortex G65RV with the following characteristics:
Processor: Intel Core i7-6700K 4.0GHz 8M Cache, up to 4.20 GHz
Hard Disk: 1 TB (SATA) 7200 + 256GB SSD PCIe 3.0
RAM memory: 32GB DDR4 2133MHz expandable
Graphics Card: 8 GB Nvidia GeForce GTX 1070 DDR5 VR Ready
Connectivity: Killer ac Wi-Fi + Bluetooth v4.1, 2 ports Killer Gb LAN
2 x Thunderbolt 3, 2 x Mini Display Port, 2 x USB 3.1 Type-C.
Unfortunately, the desktop only reaches: CPU i7-7700 and GPU 1080.
For which I thought about acquiring an external GPU:

GIGABYTE AORUS Gaming Box RTX 2070
Which tells me the following: GeForce RTX 2070 with 8G memory and 448 GB / s
memory bandwidth has 2304 CUDA Cores and hundreds of Tensor cores operating
in parallel.
I know it will not yield the same to a card connected internally on the board. but I
would like to know how the GPU communicates in those cases since the connection
is through the Thunderbolt port if I am not mistaken, and I would also like to know
how it will be the DL redeeming with CPU i7-6700k and GPU RTX-2070 with
connection Thunderbolt
Thank you.
Reply
Tim Dettmers says

2019-02-20 at 12:47
Thunderbolt 3 is pretty good for communication and you should see only a
small loss in performance (10%) for most application. This number might be
higher if you (1) have very large input data, (2) a small neural network — this is
not very common. So an eGPU should be fine.
Reply
Michel says
2019-02-15 at 08:37
Hello,
I am no expert in deep learning but the gaming community tends to consider that
going from an rtx 2060 to an rtx 2070 brings little benefit in terms of FPS or detail
rendering, a just higher price.
I am wondering whether there is any reason why the 2060 is not mentioned in your
really great review of GPUs.
Thanks!
Reply
Tim Dettmers says

2019-02-20 at 14:18
This post is just not updated. Need to do this soon.
Reply
Alvaro says
2019-02-24 at 01:40
Also, consider the new GTX 1660 Ti! The “tensor cores” have been removed
but they’ve been replaced with FP16 units. I can only guess about the
actual performance compared to RTX 2060… it would be great to find
some actual tests. Could it be just as cost-effective after the retail price
stabilizes?
Reply
Michael says
2019-03-10 at 07:04
I would be interested in the comparison of the RTX 2060 and RTX 2070 for
deep learning applications.
Do you think it is worth going for the RTX 2070?
Reply
Mario Galindo says

2019-02-14 at 05:01
Thank you for the post.

I am buying a computer with two GPUs 1085 TI. On the mother board there are no
screen connections nor space to other GPUs. Where should must I connect the
monitors? I want to install 3 monitors. Should I connect them to the GPUs? Or, must
I buy another mother board?
Thank you.
Reply
Tim Dettmers says

2019-02-20 at 14:19
Just connect them to your GPUs. It will barely impact performance of your
GPUs.
Reply
Vishal Bajaj says

2019-02-12 at 09:10
Hi Tim,
Would it be possible to setup a system with a 1080Ti and a 2080 Ti and use them to
perform parallel training?
Reply
Tim Dettmers says
2019-02-20 at 14:20
This does not work, unfortunately. You need the same chip architecture to do
GPU-to-GPU parallelization. You can do GPU-CPU-GPU parallelization, but that
often yields no speedups.
Reply
Vishal Bajaj says

2019-02-22 at 12:40
Thanks for the input Tim. I guess i will try to get the 2080Ti, but i keep
reading many reviews of them dying! So a little afraid to put down 1800$
(CAD)
Reply
Ari H says
2019-02-07 at 02:50
It’s just a shame that AMD’s latest GPUs would have potential to demolish NVIDIA’s
overpriced cards on deep learning if they just fully supported PyTorch. Currently
they’re about as good as doorstops unless you write everything with OpenCL
yourself. Their priority should be to get PyTorch working ASAP.
Reply
Tim Dettmers says

2019-02-20 at 14:25
Agreed. There are some efforts to do this, but it is a delicate issue because
PyTorchs code-base was an older code-base which was built upon. I hope soon
they can figure out the last issues and then I would be happy to recommend
AMD cards as well.
Reply
Joshua Marsh says

2019-02-02 at 06:12
Hey Tim, thanks for creating this amazing repository of information!
I just wanted to hear your thoughts on the differences between the RTX 2080 Ti
founders’ edition vs the other RTX 2080 Ti’s with third-party hardware from ASUS,
Gigabyte, MSI, etc (or just the advantages of FE vs not FE cards in general). In my
country, the FE is about $300 USD cheaper so if the others do not have any real
advantages for AI I would prefer to go with the FE. Also, as I am considering water
cooling, the advantages from gained due to superior cooling may not be a
concern.
Thanks Tim!
Reply
Tim Dettmers says

2019-02-20 at 14:28
If the FE is $300 cheaper definitely go for the FE card!
Reply
Ando says
2019-02-01 at 12:59
Hi Tim,
Thank you very much for the guide.
I am trying to build my first DL machine. Following your advice, I am looking at to
start with an RTX 2070, I will add either another 2070 or an 2080 Ti later, and
maybe even a third one. This is my build, if you have time, please have a brief look:
https://pcpartpicker.com/user/ando_khachatryan/saved/yQkNQ7
My concern and question is about the cards: while looking for a blower-style card
on Amazon, I encountered lots of negative reviews for cards from different vendors,
and the vast majority were describing the same problem: card worked out-of-box,
then, after a week of gaming, artifacts started to appear and the games started to
freeze/crash.
Any comments on this? Have you seen/heard about this?

My problem is that returning the cards would be a major issue for me, I don’t live in
the U.S. and RMA/return would cost me a lot of money.
Reply
Tim Dettmers says

2019-02-20 at 14:30
Looks like a good build with some spare room for more GPUs.
I have heard about the problems. It is unclear still if all RTX have this problem
or if the first batch of RTX cards in the release had this problem. It is worth it to
look at the date of the reviews and see if it got better over time. I personally
have no problems with my RTX cards, but maybe I have been lucky so far.
Reply
Ryan S says
2019-02-01 at 08:34
Hello Tim, I have a few very fundamental questions. I plan on using an NVIDIA Tesla
P4 GPU on a server (let’s say Intel Xeon 16core, 128GB RAM, 2x10GbE etc.,). From a
popular manufacturer that has such a config, it states that the system can handle
video analytics (like face detection) on 9 concurrent video streams @ 720p/15fps.
My question is:
– If I run video @ 720p/3fps, how many concurrent video I may be able to handle
concurrently?
– If I run video @ 1080p/3fps, how many concurrent video I may be able to handle
concurrently?
I know there are many factors related, but just as a ballpark, any suggestion if
lowering the frame rate would help increase the # of video streams. Is this a linear
equation of any kind?
Thanks!
Reply
Tim Dettmers says

2019-02-20 at 14:33
I do not know what this is referring to exactly, but one assumption that could
be reasonable is to say it scales linearly, that means 9*5 streams for 720p/3fps
and 9*5/2.25 for 1080p/3fps, but I do not know if that works out. The best it to
ask the manufacturer yourself.
Reply
Bruce says
2019-01-27 at 14:14
Hello!
Maybe I am a bit confused, but can I have a config with more than one RTX 2070 at
the same time? Because 2070 don’t support SLI
(https://hothardware.com/news/nvidia-geforce-rtx-2070-gpu-will-not-support-
nvlink-sli-but-why). Does it matter?
Thanks in advance!
Reply
Tim Dettmers says

2019-02-20 at 14:50
CUDA code cannot use SLI for communication for GPUs. Instead, GPUs
communicate via the PCIe network. Thus no SLI support is needed for
parallelism.
Reply
Nasi says
2019-01-24 at 09:04
Hi Tim,
Thank you for this article.
I already have a GPU, 1080, in my PC. I am going to install GPU 2080ti along with
the previous one.
1) Is it possible to have two different types of GPUs in one PC and use them for
training a neural network, especially in tensorflow? I do not know how to prepare
the environment to use both GPUs. Is it possible for you to sent me a good tutorial
link for that?
2) There are two PCI Express slots on the motherboard, but they are too close to
each other. If I install the new one in the empty slot, the fan of the old one will be
blocked by the new one. So, I think I should either buy a new case (with a new
motherboard) or buy a PCI express riser. I found multiple links to buy a PCI riser, but
I do not know whether they are good or not. If I use a PCI riser, I will put the new
GPU outside the case, and I will not close the case. Could you please give me your
opinion about PCI express risers?
https://www.amazon.fr/Cablecc-Gen3-0-16-PCI-Express-x16-Extender-Up-
Angled/dp/B07GBRQPQF/ref=sr_1_17?ie=UTF8&qid=1548337647&sr=8-
17&keywords=pci+express+riser
https://www.gearbest.com/other-pc-parts/pp_672357.html?
wid=1433363&currency=EUR&vip=4450235&gclid=Cj0KCQiA4aXiBRCRARIsAMBZG
z_d_R54eWGNs1vpAKV0qBtUDNK9MGw7HzNrLH4d5MFlfCpBMGC9s2IaAm4tEALw
_wcB#anchorGoodsReviews
https://azerty.nl/product/delock/670177/riser-card-pci-32-bit-with-flexible-cable-
left-insertion-riser-kaart?
gclid=Cj0KCQiA4aXiBRCRARIsAMBZGz9LzTqekoEhsr6sRVCwqBNfrWdTBVhDgzYHfu
7dNwTBOLBLCfgUn5caAumqEALw_wcB
https://www.amazon.com/Ubit-Multi-interface-Function-Graphics-
Extension/dp/B076KN7K5Q
Best regards,
Nasi
Reply
Tim Dettmers says

2019-02-20 at 14:53
1) Yes, but you will not be able to parallelize a deep neural network across
those two different GPUs.
2) If one of them has a blower fan you might be able to put the RTX 2080 Ti
left of the other GPU. Otherwise, you can always buy a riser/extender if you
have overheating issues.
Reply
Ehtesham says
2021-09-11 at 05:01
Well, I tried something on Hp Proliant ML350 Gen8 for mining. (I wish i

may put a picture here). 6 GPU setup. 3 mini risers 009S at one side of
mobo and 3 on the others side of the board. Those six USB wires let them
through the back side of chassis – and mount my 6 GPUs on the top of
server.
All 6gpus (GTX 1660Super) recognized by windows 10 system. Now the

difficult part.. Proliant ML350 Gen8 has separate 400W power supplies
(combined 800W). Technically, 1660 super requires and draw 90W. In this
scenario, total consumption is 540W.. so i decided to put 750W additional
power supply to meet my requirement.
I failed.. coz power supply goes down in 1 minute.. and windows gets
crashed and system get rebooted… Same exercise i tried with HIVEOS but
same result.
Today I will try to put a 1500W power supply along with 800W (already
installed supply) and shall see the results… Hope it works.. !
Reply
Tim Dettmers says

2021-10-24 at 11:45
Thanks for sharing this! This shows how difficult it can be to get the
power requirements right
Reply
Gurunath says
2019-01-21 at 21:23
While you have mentioned that PCIe lines don’t matter significantly for a <=4 GPU
setup, I plan to use a setup with 8 GPUs (RTX 2080) for NLP, speech recognition
tasks. Would the number of PCIe lines significantly affect the performance in such
applications? What would be your advice on the number of PCIe lanes for each
GPU in this 8 GPU setup for NLP and Speech recognition tasks?
Info: For example, our current NLP task on sequence-to-sequence model for a
batch of 100 sentences, each restricted to 128 tokens (each represented by a 64-bit
tensor) in Pytorch takes around 120-150 ms per iteration on a single GPU(1080Ti).
Thanks in advance.
Reply
Tim Dettmers says

2019-01-23 at 08:37
If you want to parallelize across 8 GPUs the PCIe lanes will matter quite a bit
compared to 4 GPUs. The communication requirements scale linearly with the
number of GPUs (if you use the right communication algorithm). However, if
you run 8 GPUs on a regular 4 GPU motherboard you are also halving the PCIe
speeds and you will have 4 GPUs behind a PCIe root complex. Since only one
GPU behind a PCIe root complex can communicate with another root complex
it means you need 8x the time to sent the same amount of messages between
GPUs compared to 4 GPUs. So in total, the communication with 8 GPUs on 4-
GPU motherboard will be 32 times more expensive than 4 GPUs on a 4-GPU
motherboard. If you want to parallelize 8 GPUs efficiently, you will need 4 PCIe
root complexes and this often means 2 CPUs and server-grade hardware (EPYC
systems might be an exception, but I am not sure if those motherboards
support 4x root complex setups).
If you do not want to parallelize a network across all GPUs, you will be fine —
just note that with this system you cannot really do parallel training.
Reply
Eric says
2019-01-18 at 03:42
Hi Tim,
Thanks for this post. After reading it through I still a bit unsure about my PC specs
that I would like to get to run deep learning. Mainly because I don’t want to get
hardware that don’t work with other software/hardware.
I wonder if you could recommend me a set of hardware with a budget around

£1800-£2200.
Thanks a lot
Reply
Tim Dettmers says

2019-01-18 at 12:33
I recommend using pcpartpicker.com it should take care of hardware

compatibility and will alert you if there are some issues with your build.
Reply
Abdelrahman says
2019-01-10 at 14:55
Thanks for this article

I want to buy GTX 1060 6GB for Master’s research on Object detection. But, i don’t
have enough budget so i’m planning to buy Xeon processor like E5-1620 with 24
GB ram.
Is this CPU better for my purpose, or i i7 would bee better?
Reply
Tim Dettmers says

2019-01-18 at 12:25
The CPU will be fine for deep learning with a GTX 1060. However, if you want to
preprocess data might take more time with such a CPU.
Reply
krzh says
2019-01-08 at 12:24
Do you think this is a well balanced system? https://pcpartpicker.com/list/kv3N4q
Reply
Tim Dettmers says

2019-01-18 at 12:21
Yes, that looks quite good. The system would also work well with more than 2
GPUs so if you have any plans to use more than 2 GPUs you could get a
motherboard with more PCIe slots. Otherwise, all good!
Reply
Nitin Gupta says

2019-01-08 at 06:34
Hello Everybody,
I have been trying to get the multiple Gpus to work on a Ubuntu system.
I am using this ubuntu 16.04 LTS along with nvidia-smi
+—————————————————————————–+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+=====================
=+======================|
| 0 GeForce GTX 106… Off | 00000000:02:00.0 Off | N/A |
| 6% 57C P0 26W / 120W | 321MiB / 6072MiB | 0% Default |
+——————————-+———————-+———————-+

+——————————-+———————-+———————-+
+——————————-+———————-+———————-+
+—————————————————————————–+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=====================================================
========================|
| 0 1165 G /usr/lib/xorg/Xorg 198MiB |
| 0 2066 G compiz 119MiB |
| 0 2746 G /usr/lib/firefox/firefox 1MiB |
+—————————————————————————–+
although the system shows that it has all the cards , but they dont get used even
when i try keras for mutliple gpu learning
motherboard : asus prime z270p

processor : intel i7 7700 lga 1151
gpu : zotac gtx 1060 ( 3 in number)
if any other information is required for solving this , i can provide the same , i am
using the pcie bridge to raise the gpu and use them.
Reply
Tim Dettmers says

2019-01-18 at 12:19
This should usually work. I guess the problem might be the PCIe bridge. It is
difficult to tell with this information and it is not straightforward to debug. If you
can use two GPUs without PCIe bridge and try again.
Reply
Nitin says
2019-01-18 at 20:36
I have used two gpus without pcie bridge , these 2 gpus are now mounted
on the motherboard , but i am still not able to use both of those gpus.
Tensorflow starts to use memory from both but does not use the second
one for processing.
Reply
Tim Dettmers says

2019-01-23 at 08:21
Did you write code that utilizes both GPUs? You can try to run some
code which tests parallelism. There are some multi-GPU samples from
NVIDIA (CUDA samples) which test if parallelism between your GPUs is
possible. If this sample works it will be a software issue.
Reply
geek12138 says
2019-01-07 at 08:45
Hi Tim,
A titan rtx and two 2080ti which are more suitable, considering the memory
difference between 24g and 11g*2.
Reply
Tim Dettmers says

2019-01-07 at 19:49
The computing power on two RTLX 2080 Ti is almost double that of the Titan
RTX. Thus its a question of compute vs memory. If you want faster compute go
with 2x RTX 2080 Ti if you want more memory go with the Titan RTX.
Reply
Neil M says
2019-01-03 at 10:53
Hi Tim,
Happy New Year and thanks for a great blog!
I’m building my first deep learning work station and based on your guide I’ve just
purchased an RTX 2070.
I’m re-purposing an existing older workstation as the basis for the build. The
specification of the machine is:
CPU – Intel i7 3820

Motherboard – MSI X79A-GD65 (8D)
RAM – 16Gb DDR3 (2400 MHz)
Boot Drive – 240Gb Kingston HyperX Fury
Data Drive – 4TB Western Digital Red
OS – Ubuntu (64 bit)18.04
I want to use the existing GeForce 660 GPU to drive the monitors and keep the RTX
2070 solely for computation. Looking at the NVIDIA website both GPUs use a
common driver so I expect this will work. Do you forsee any issues or limitation with
this approach or my current spec? Thanks.
Reply
Tim Dettmers says

2019-01-07 at 19:41
I think the system should work quite well with an RTX 2070. Some computer
parts are older thus some parts of common code, like preprocessing, would be
slower, but your deep learning performance should be close to what other
people report with modern desktops.
Reply
Peixiang says
2018-12-27 at 16:36
Can I use two different GPU at the same time? Say 1080Ti and 2070? What are the
issues I may encounter?
Reply
Shayan says
2018-12-23 at 16:00
Hi Tim,
Can you please comment on what type of setup is used in this video
[https://www.youtube.com/watch?v=RFaFmkCEGEs&t=54s], at 0:47 seconds you
can see he has 4 nvidia GPUs by using the nvidia-smi command, however he is
using a macOS.
Also would you recommend using macOS (w/ gpus) for competitions.
Kind Regards
Shayan
Reply
Tim Dettmers says

2018-12-27 at 17:36
These are GTX 1080 Ti GPUs. There are some compatibility issues with macOS
that only certain NVIDIA GPUs are supported, but I do not know the details.
For this, I usually do a google search on reddit: “site:reddit.com which NVIDIA
GPUs work for macOS deep learning”
Reply
Fiz says
2018-12-21 at 16:27
Hi Tim, questions on your comments “For good cost/performance, I generally

recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use
16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from
eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).”
do you mean
(1) on RTX cards running 16-bit models indirectly doubles the available memory for
deep learning compared to 32-bit models, is that correct?
(2) the facts in (1) are not valid for GTX cards, i.e. 32-bit models and 16-bit models
makes no difference?
(3) how to explicitly run in 32 vs 16 models when deep learning? Are there
examples?
Reply
Tim Dettmers says

2018-12-27 at 17:33
1. It is not a straight doubling, but the memory requirements are much lower.
2. You can have 16-bit models with GTX cards, but what happens under the
hood is that all values will be cast to 32-bit before any computation. So the
weights are 16-bit and the computation 32-bit for GTX cards. However, you
should also see a good reduction in memory if you use 16-bit weights with GTX
cards.
3. In PyTorch is can be as simple as “model = model.half()” and you will run in
16-bit mode. In practice, it can be a bit more complicated depending on the
model that you are running. You can have a look at NVIDIA’s 16-bit
library Apex that is built on PyTorch for more sophisticated examples.
Reply
Phil says
2018-12-20 at 05:48
Hi Tim,
Can you please tell me if I am doing something wrong. The idea is to run LSTMs
with many datasets that are rather small (<5Gb). I will have several GPU not to
parallelise but to run different optimisations at the same time. I am not a hardware
expert and I want to make sure that I don't waste GPU power because of a poor
setup. If it goes well, I will replicate it to populate a rack.
-Gigabyte X399 Designare Ex

– AMD Threadripper 1920X
– 4x KFA2 RTX 2070 OC
– 4x HyperX Fury 16Go, DDR4-2400, DIMM 288
– Samsung 860 EVO Basic (500Go)
– WD 4To
– Corsair AX1600i (1600W)
I still need to find a 3U or 4U rackmount that fits.
I really appreciate your help
Phil
Reply
Tim Dettmers says

2018-12-27 at 17:20
The RTX 2070 cards that you chose might be prone to overheat in that
configuration. I would pick a blower-style RTX 2070 card instead. Otherwise a
good build. I am not sure though if you can easily find 3U or 4U racks that fit
well with this configuration.
Reply
Phil says
2019-01-01 at 07:02
Thank you very much!
Happy New Year to you.
Reply
Saurabh K says
2018-12-19 at 19:16
Just want to know your thoughts on the combo I am looking at:
1. MSI X299 Gaming PRO Carbon

AC: https://www.amazon.com/gp/product/B071G3JR9Y/ref=crt_ewc_title_dp_3?
ie=UTF8&psc=1&smid=ATVPDKIKX0DER
2. Intel Core i7-7800X X-Series Processor (28 PCIe lanes for possible future
expansion):
https://www.amazon.com/gp/product/B071H1B3Z1/ref=crt_ewc_title_dp_1?
ie=UTF8&psc=1&smid=ATVPDKIKX0DER
The motherboard seems pretty expensive, any other suggestions for the
motherboard compatible with i7-7800x? The reason I am going for that one is
because it supports wireless LAN. Else MSI X299 RAIDER LGA is a pretty good
option (https://www.newegg.com/Product/Product.aspx?item=N82E16813144059).
Any thoughts?
Reply
Tim Dettmers says

2018-12-27 at 17:17
Other motherboards that do not support WLAN are fine as long as you get a
USB wifi adapter. This combination might save you a bit of money on the
motherboard. The i7 is a very versatile CPU — a bit expensive but it will show
strong performance in any case!
Reply
Ahmed Adly says

2018-12-18 at 01:51
Hi Tim,
Can I add an RTX 2080ti to my existing 2 GTX 1080ti to improve the training time
for voice recognition application?
Reply
Matt says
2018-12-17 at 12:01
Beginner who had some perishable credit on electronics vendor with limited
computer parts supply. Got my hands on a evga rtx 2070 xc 8GB and a full atx case.
Started the fastai course but the workflow with cloud resources got too annoying.
So i have my hands on the rtx 2070 already. Don´t want to waste too much of its
capability for most beginner/intermediate use cases but do not have that much to
spend.
the impression i get from the latest guide update is that even a i3 7100 or g4560
would not hamper the gpu or only slightly (and those cpu are really cheap). Have i
understood it correctly?
Reply
Tim Dettmers says

2018-12-27 at 16:31
Yes a single RTX 2070 should be easy to utilize with an i3 7100. However, if you
preprocess a lot of data you might still run into some bottlenecks. If you make
sure that you have good quality preprocessing code you should be fine.
Reply
Nazim says
2018-12-17 at 06:12
What are your thoughts on Titan RTX?
Reply
Tim Dettmers says

2018-12-27 at 16:29
Too expensive for the given performance.
Reply
Nazim says
2018-12-28 at 05:19
So two 2080 ti better than One Titan RTX?
Reply
Abir Das says

2018-12-17 at 05:12
Hi,
Thanks for the detailed post and also the enlightening discussions. I have prepared
two builds which I am sharing below. Can you please provide your suggestions. I
have some specific questions regarding the builds which I am providing with the
build configs. But, before that let me provide my requirements/type of ML works I
intend to do.
1. I am a very new assistant professor at an Institute in India. I will eventually get the
money to procure 2 workstations with 4 GPUs (I am at least hoping 1080Ti) in each.
But, that will take time and I want a decent build to get the ball rolling. For that I
have already bought 2 1080Ti GPUs and a Samsung 860 EVO 500 gb around 3
months back when I was in the US. So, they are sitting idle now. To avoid this, and
to get started, I want to buy the other parts of a DL machine from my pocket. My
budget is around Rs. 100,000 [Rs is the Indian currency].
2. The machine will be in the server room of the institute. So, the cheapest cooler
[whatever noise level] and cabinet is what I would prefer.
3. My student [only one at this moment] will run RL codes [both training and
inference] on images. Later, I might do some classification work on videos [but this
is a distant possibility at this moment, and I might be able to procure the servers
with 4 GPUs by then].
4. I don’t plan to expand this machine beyond 2 GPUs. My long term plan is to
make this a student machine that will have even 1 GPU and the student can
develop/prototype codes here while the stable code would run in the 4 GPU
servers.
5. My builds provide some prices which does not have web links. This is
from https://mdcomputers.in — a local but reputable vendor. I could not find how
to link their product pages to pcpartpicker. Otherwise, I would have done that. So,
you have to believe my words about the price.
My first build uses core i5 9600K and a compatible motherboard. As my budget

allows me to spend some more money, should I go for i7 8700K? This change
along with a change of motherboard costs me Rs. 6000 more [still remaining inside
the budget]. Core i7 processor supports hyperthreading. My student says, from
time to time, he might use multiple threads for preprocessing in Tensorflow. Or,
instead of upgrading the processor, should I just go for RAM upgrade to say a total
48 GB [and stick to core i5 9600K]? This will cost me around Rs. 12,000 more [and
so I am remaining within the budget]. And later, I can pull out one memory when
the system, ultimately, becomes a student’s developing machine with one GPU.
Build with core i5 9600K — https://in.pcpartpicker.com/user/dasabir/saved/fgQ299
Build with core i7 8700K — https://in.pcpartpicker.com/user/dasabir/saved/bD7J8d
While looking for the option of core i7 8700K, I came across core i7+ 8700
[https://mdcomputers.in/intel-core-i7-8700-bo80684i78700.html]. I see that this will
cost me Rs. 11,000 more over my core i5 9600K build. I am not sure what is the
difference between an i7 8700K and i7+ 8700 (other than the frequency/speed).
Here is the comaprison link — https://ark.intel.com/compare/126684,140642 . Will
i7+ 8700 require different motherboard? It says the box includes NVME 3.0 x 2,
does it help me? Also the i7+ processor includes a 16 GB optane memory. Will it be
of any help (e.g., keeping the OS there)? Also does optane memory occupy PCIe
lanes? Any suggestion on this would be great to have.
My second build is with AMD processors. I tried with AMD Ryzen 7 2700X. The
price is coming around the same as the core i5 9600K build. It does have 8 cores
compared to 6 cores for the intel processors, but does AMD have hyperthreading?
I am not sure. Also it does not have MKL, is intel MKL going to be crucial for deep
learning?
Build with AMD Ryzen 7 2700X
— https://in.pcpartpicker.com/user/dasabir/saved/3ddTBm
Though you say number of PCIe lanes are not that important especially with 2
GPUs, I just tried my luck with an AMD threadripper processor. As expected, it is
overbudget. But, if you say, it is worth spending this much money, I might also go
for it.
Build with AMD Threadripper 1900X
— https://in.pcpartpicker.com/user/dasabir/saved/73mhyc
Abir
Reply
Tim Dettmers says

2018-12-27 at 16:26
Either build is fine. You could buy a bit cheaper RAM which lower speeds — it
will not make a big difference. If the ADM build is too expensive and you run
only 2 GPUs I would rather good with an i5 or i7 build.
Reply
Abir Das says

2018-12-28 at 00:32
Hi Tim,
Thanks a lot. The AMD Ryzen 7 2700X build is the cheapest. So I will go
with this. I tried to see low speed RAMs, but that is not saving me much.
So, I will go with the AMD 7 2700X build

i.e., https://in.pcpartpicker.com/user/dasabir/saved/3ddTBm
Reply
Alam Noor says

2018-12-16 at 19:24
Thanks for sharing
Reply
Zhenlan says
2018-12-16 at 17:19
Hi Tim, Thanks for the great guild. I found you blog super helpful when I built my
first box in 2016. And it is still worthwhile to come back for updates.
One question about GPU RAM. I am getting serious about CV competitions on

Kaggle. And that entails using big pre-trained networks like Xception. I have a 1080
ti and use batch size of 16 for inputs image of size 256,256. On software side, I am
using tensorflow.keras (mostly for multiprocessing augmentation) and raw
tensorflow.
My computer would freeze every now and then so I have to force restart. And even
before it freezes, it would get slower by epochs. First epoch would be 90s, next
would be 120s and it just gets worse. It also gets slower and more likely to freeze as
I experiment more network structure by defining more models(clear_session() or
tf.reset_default_graph() does not help)
I use “top” to monitor cpu/RAM, either of which seems to be the problem. I use
something like “watch -n0.1 nvidia-smi” to monitor GPU. GPU utilization stays above
90%. But it does not really tell me much about memory as tensorflow automatically
allocates almost of the GPU memory at start. I tried tf.configproto() to limit GPU
memory used by tensorflow without much luck.
Do you have any suggestion as to how to diagnose this issue? Thanks in advance
and happy holidays!!
Best,
Zhenlan,
Reply
Tim Dettmers says

2018-12-27 at 16:29
It sounds like you have a memory leak somewhere in your code. First, check if
you run out of CPU RAM and your computer is swapping RAM to disk. If that
does not help try to debug TensorFlow further. If that does not help it could
help to install the newest NVIDIA drivers. If this does not help try PyTorch and
see if that works for you (PyTorch is much easier to debug in these cases).
Good luck!
Reply
david says
2018-12-16 at 13:36
Hi, I am a Computer Scientist but I have not done any project on DL before. Maybe
later I will buy RTX Titan but not in the next three months. Could you please let me
know the following?
1. Given a model and if I want to see how it behaves under different initial
parameters, will there be a problem if my desktop has two GPU of different kinds
(e.g. one GTX 1060 and one RTS 2080/2080Ti or RTS Titan)?
2. Am I correct that only when I do parallel training of the same network with the
same set of initial parameters will I need to have GPU of the same model?
3. Those 20×0 cards have Tensor Cores in addition to CUDA Cores. Are Tensor
Cores helpful in speeding up the training in Tensorflow? What else is it good to buy
RTX card now rather than GTX card?
Reply
Tim Dettmers says

2018-12-27 at 16:35
1. You can use different GPUs for different networks. However, if you want to
parallelize a single network across both GPUs they need to have the same chip,
for example 2x RTX 2080 or 2x GTX 1060.

2. Yes, this is correct.
3. You will benefit from Tensor Cores in TensorFlow, but as far as I understand
not all features of RTX cards are implemented yet and your Tensor Core code
will be fast, but will not run at 100% speed. I think PyTorch has much better
support for RTX cards, but it is just a matter of time until full features are
implemented in TensorFlow.
Reply
George M says
2018-12-16 at 11:43
Tim, thanks for updating this. Long term I am hoping to build a dual RTX 2070
system to allow for data parallelism. Would hooking up one monitor to each GPU
be a viable option? Also, in that case would the “coolbits” option be able to control
each GPU fan, or will fan control still be “hard and hacky” as you put it?
Reply
Brendan says
2018-12-14 at 07:23
Hey Tim,
First I want to thank you for this blog, it teaches a man to fish rather than giving
him a fish as the old aphorism goes.
I have a few questions related to hardware that I’m a little unclear on, and also that
are pertinent as PCIe 4.0 slots are rumored soon. First a little background on my
build, I’m going to be building a computer primarily for statistical computing before
I begin a doctorate program in stats/applied math. This means it will first need to
be good at serial processing which is why I’m entertaining CPUs that are overkill in
terms of CNN needs (the type of neural networks I will be using). I do want it to be
able to do CNN work as I am intrigued by and play around with that somewhat.
1) I’m looking between an i7 8700K and an i9 9900K. I was entertaining an AMD

ryzen 2700X but I realized that I shouldn’t as that has a max number of lanes of 20,
correct?
2) Should I wait for my build until PCIe 4.0 motherboards are released? All I see
now are rumors but it is rumored that they will be released next year in 2019. Given
that DMA is the main bottleneck here, wouldn’t it be beneficial to wait until PCIe 4.0
is available (31.508 GB/s more than doubles the performance of PCIe 3.0)?
3) If that is true, then would RAM memory speed actually be a factor? I don’t think
so, as you stated your current setup gets over 50 GB/s for the RAM which would
still be above the DMA bottleneck of PCIe 4.0
4) Even if PCIe 4.0 motherboards are released in 2019, they would still be
compatible with the processors I mentioned above, correct? If so, then building my
rig now wouldn’t hamper me as I could just upgrade the motherboard once PCIe
4.0 compatible motherboards are available. Is that right or are we unsure if there
will be LGA1151 compatible PCIe 4.0 motherboards?
5) I’m looking at a GTX 1080 ti and a RTX 2080 ti for my GPU. I think the RTX 2080 ti
is a little outside my price range, but I would be debating between a GTX 1080 ti
with a water cooling block setup or an RTX 2080 ti without water cooling. Which do
you think will likely perform better as the temperatures will likely hamper
performance of the GPU with the stock fans?
Thank you again for this post and for your continued answering of questions in the
comments. If you have time, I would greatly appreciate a response!
Cheers,
Brendan
Reply
Tim Dettmers says

2018-12-18 at 11:47
Hi Brendan,
1) The Ryzen 2700X would be fine for up to two GPUs. If you want more GPUs I
would look for a different CPU.
2) PCIe 4.0 will not help much with deep learning and I would not wait for it.
3) Memory speed is not much of a factor. I would just buy cheap RAM.
4) This is determined by the motherboard. For PCIe 2.0 -> PCIe 3.0 we saw that
the new motherboards often supported only the most recent CPU sockets. I
believe this could be the case for the new PCIe 4.0 boards too.
5) A single RTX 2080 Ti on air should be fine. You should see a slight
performance decrease but it is still faster than the GTX 1080 Ti. If you do not
have the money for an RTX 2080 Ti, both a water- or air-cooled GTX 1080 Ti
should be a great option.
Reply
Haider Alwasiti says

2018-11-16 at 01:36
Your statement in 2015, is it still true with current frameworks? I used pytorch with
fastai and all threads and cores are maxed out usually ( image training in resnet34) :
Needed number of CPU cores

When I train deep neural nets with three different libraries I always see that one
CPU thread is at 100% (and sometimes another thread will fluctuate between 0 and
100% for some time). And this immediately tells you that most deep learning
libraries – and in fact most software applications in general – just use a single
thread. This means that multi-core CPUs are rather useless
Reply
Tim Dettmers says

2018-11-18 at 11:46
I think it is partially true. Preprocessing still dominates CPU needs. If you do

preprocessing while training you need more cores. However, you could also
preprocess your data in bulk before training.
However, some frameworks also use quite a bit CPU in the background like
TensorFlow. I do not have the deepest insights into this, but TensorFlows graph
pipeline is quite sophisticated and might need more CPU cores to process
efficiently. The benefits for PyTorch would mainly lie with background loader
Threads.
Reply
Eslam Haroun says

2018-11-04 at 14:08
Hi Tim,
Sorry for missing this point.
Is a chipset with dual x16 pcie 3.0 that is compatible only with 16 pcie lanes cpu
Is this motherboard equivalent to a chipset with single x16 pcie 3.0 slot or dual
x8/x8?
Is it equivalent to a chipset with dual x16 pcie 2.0?
Thanks
Reply
Tim Dettmers says

2018-11-18 at 11:50
16x slots and lanes for the CPU are different things. If you have 2 PCIe 16x slots
and CPU with 16x lane that would be perfect for a 2 GPU setup!
Reply
Eslam Haroun says

2018-11-04 at 10:43
Hi Tim,
Thank you for your great guide.
Are the following components sufficient for 2 GPUs system with their full power?
Motherboard: ASRock Z270 PRO4 LGA1151/ Intel Z270/ DDR4/ Quad CrossFireX/
SATA3&USB3.0/ M.2/ A&GbE/ ATX Motherboard.
Power Connectors 24-pin main power connector, 8-pin ATX12V connector
CPU: Intel BX80677G4620 7th Gen Pentium Desktop Processors

GPU: EVGA GeForce GTX 1060 GAMING, ACX 2.0 (Single Fan), 6GB GDDR5, DX12
OSD Support (PXOC) 06G-P4-6161-KR
RAM: Corsair CMZ8GX3M1A1600C10B Vengeance Blue 8 GB DDR3 1600MHz (PC3

12800) Desktop Memory 1.5V
HDD: WD Blue 1TB SATA 6 Gb/s 7200 RPM 64MB Cache 3.5 Inch Desktop Hard
Drive (WD10EZEX)
PSU: should it be 24-pin main power connector, 8-pin ATX12V connector?
like:
Antec EarthWatts Gold Pro 550W Power Supply 550 Watt 80 Plus Gold PSU with
120mm Silent Cooling Fan, Semi Modular, 7 Years Warranty, 99% +12V and ATX12V
2.4 – EA550G PRO Black
or 20+4 like:
Antec EarthWatts Gold Pro 550W Power Supply 550 Watt 80 Plus Gold PSU with
120mm Silent Cooling Fan, Semi Modular, 7 Years Warranty, 99% +12V and ATX12V
2.4 – EA550G PRO Black
Thank you.
Reply
Tim Dettmers says

2018-11-18 at 11:49
Sorry I am not able to look at a build in full detail. If you can narrow down your
problem to a single question I might have time to answer.
Reply
Parag Maha says

2018-10-23 at 23:13
Thank you for sharing this article…You explain every thing very well in article as well
as in comments also..It is very helpful for me as I am preparing for hardware
courses so I used to search these things and I found that your blog is simply
awesome among all..Thank you once again…Waiting for your new article…All the
very best..KEEP WRITING
Reply
Mohammed Sarraj says

2018-10-21 at 07:35
Hello Tim,
I am adding these questions to the list of questions mentioned above ( just a
reminder):
+ I am going with a 2x 2080ti setup for now, and I am going to be expanding in the
future to 4x 2080ti. However, would I benefit from using NVLinks? If so, how is it
going to impact things? For example, would I be able to double my memory?
Would it it affect other bottlenecks?
+ Since, I am going to be using 4x 2080Tis and probably a 9920x (depending on

your answer for the questions from the previous post), I am expecting to use about
1800 Watts. However, I couldn’t find any PSUs that are with that capacity. I am
looking for a PSU with low voltage ripple (for overclocking purposes). However, I
wasn’t able to find even a single 1800 Watt PSU in the market. What is your advice
in this case? I’ve heard of the Add2PSU component
( https://www.youtube.com/watch?v=erHoq3DbwVA ). However, I am not sure how
safe it is to use multiple PSUs on a single PC. What do you think?
Thanks
Reply
Tim Dettmers says

2018-10-23 at 09:14
(1) NVLink on RTX cards is currently limited to two GPUs only and it will not help
you too much in that scenario for data parallelism. For model parallelism, it
could help, but currently, there are no models and code that profit from that.
So it might be useful in the future, but not right now.
(2) 1800 Watts is a bit much. You need about 275 per GPU and another 300 for
the CPU which is about 1400 Watts. If you get a 1300 to 1600 Watts PSU you
should be fine I think even if you overclock. There are some nice ones from
EVGA; you can find them easily if you search newegg.
Reply
Mohammed Sarraj says

2018-10-23 at 09:59
Thank you so much! One more thing, can you please help me with the
questions from the previous post? I will post the questions here:
”
Just a quick clarification on your reply earlier, I am planning on expanding
later to 4 x 2080 TI. So, in this case I would go with Rampage VI Extreme &
(9920x or 9940x depending on price) and 2x2080ti. However, the issue is
that I am concerned about the number of lanes going into GPUs. Both of
these CPUs has 44 lanes, which is not enough to run 16 lanes on each GPU.
Does it matter (16 vs 8 lanes)?
Another option is to go with threadripper which has 60 lanes (again not

enough for 4×16), but at least I might be able to run 3×16 + 1×8.
Also, you addressed this before, but I just want to confirm, CPU clock is
irrelevant. Basically, I am not losing anything by going down from 9900k @
5.3 GHz to 9920x @ 4.7 GHz to even a threadripper 2950x @ 4.4GHz.
right?
Also, is there a difference between using an intel cpu vs amd? (Sorry if this
seems like a very broad question, but I’m not sure which way to go; intel
has higher clocks and worst value-for-money, while amd has better value
and more lanes)
Thank you again for being patient with me! Choosing the build
components has been a steep learning curve for me. I am really glad that I
found someone to point me to the right direction.”
Thanks Tim!
Reply
Mohammed says
2018-10-13 at 13:38
Thanks for your prompt response!
Just a quick clarification on your reply earlier, I am planning on expanding later to 4

x 2080 TI. So, in this case I would go with Rampage VI Extreme & (9920x or 9940x
depending on price) and 2x2080ti. However, the issue is that I am concerned about
the number of lanes going into GPUs. Both of these CPUs has 44 lanes, which is not
enough to run 16 lanes on each GPU. Does it matter (16 vs 8 lanes)?
Another option is to go with threadripper which has 60 lanes (again not enough for
4×16), but at least I might be able to run 3×16 + 1×8.
Also, you addressed this before, but I just want to confirm, CPU clock is irrelevant.
Basically, I am not losing anything by going down from 9900k @ 5.3 GHz to 9920x
@ 4.7 GHz to even a threadripper 2950x @ 4.4GHz. right?
Also, is there a difference between using an intel cpu vs amd? (Sorry if this seems
like a very broad question, but I’m not sure which way to go; intel has higher clocks
and worst value-for-money, while amd has better value and more lanes)
Thank you again for being patient with me! Choosing the build components has
been a steep learning curve for me. I am really glad that I found someone to point
me to the right direction.
Reply
Mohammed says
2018-10-10 at 18:26
Thank you Tim for such a great guide! I have a question about the asynchronous
mini batch allocation code you mentioned. I am using python mainly through Keras
and sometimes Tensorflow. How can I do the asynchronous allocation? Also, I am
not familiar at all with cuda code, but how hard is it to learn? And, is there a way to
integrate cuda code into my normal use (python amd keras)?
Reply
Tim Dettmers says

2018-10-10 at 19:52
This blog post is a bit outdated. It seems that TensorFlow is using pinned host
memory by default, which means that you are already able to do asynchronous
GPU transfers. While I stressed it in the blog post, its actually not that big of a
bottleneck for most cases. For large video data, it could have a good impact.
Reply
Mohammed says
2018-10-11 at 03:19
So here is the build I am considering. I should say that my main goal is to

learn deep learning as quickly as possible. I am planning on doing as many
kaggle competitions (open and closed) as possible Can you please help?
GPU: 1 x GTX 1080 Ti + 1 x RTX 2080 Ti (I might add a third card depending
on your recommendation)
Cooling: Liquid (open loop)
PSU: RM1000x ( I might go abit higher in terms of OC quality (higher tier)
and power delivery (targetting 60-80% at full load for best efficiency, where
expected maximum load is ~800W). I am considering getting an AT1200x
instead
As for CPU and RAM, there are 2 options:

Option A: (Higher CPU clock but Dual Channel Memory)
CPU: 9900k with 16MB smart cache (not sure but estimated @~5.2-5.5GHz
for all 8 cores)

Motherboard: Asus Maximus XI Extreme
RAM: 64 GB (OC to 4000MHz+)
Option B : (Quad Memory Channel but Lower CPU Clock and I think
overkill core count)
CPU: 7920x with 16.5 MB cache (not sure but estimated @~4.7GHz for all
12 cores)
Motherboard: Rampage VI Extreme
RAM: 64 or 128 GB (depending on your recommendation) again (OC to
4000MHz+)
I really like option A because of CPU clock speed. However, if I go with

option A, then I will be using a cloud service for large datasets. The
advantages of option B is the option to expand memory to 128 GB and the
quad channel. Also, not sure about the difference between the two
motherboards in terms of multi-gpu pcie lane allocation (I don’t
understand the specs)
So, do you have any comments on the builds?

Which one would you pick?
Where is(are) the bottleneck here?
Thanks
Reply
Mohammed says
2018-10-12 at 05:37
Also, how much RAM do I need for Kaggle competitions?
Reply
Tim Dettmers says

2018-10-12 at 09:43
For +2 GPUs choose the Rampage VI motherboard; if you settle at 2

GPUs then Maximum XI. I would also go for two GPUs with the same
chipset, either GTX 1080 Ti or RTX 2080 Ti so you will be able to
parallelize across GPUs. I would get RAM with a lower clock — there is
almost no gain for OC RAM and it is quite expensive.
Reply
Mohammed says
2018-10-12 at 09:52
So, your point from the article about (RAM clock) is not outdated.
In other words, is RAM clock irrelevant because of asynchronous
mini batch allocation?
What about data cleaning and pre-processing? Does the same
logic apply?
Also, for Kaggle competitions, how much RAM do you think I

would need? I hear that 32-64 GB is recommended. However, for
image competitions, 128 GB is the minimum. What do you think?
Thanks for being patient with me!
Tim Dettmers says

2018-10-12 at 10:07
I think 32GB is good, but sometimes you need to write

cumbersome memory efficient code — with 64GB you can avoid
that most of the time. 128 GB is a bit overkill for most competitions
I think. If you have 64GB and it is not enough then you can always
spend some time optimizing your code. If you still find your RAM
lacking you can also just order more RAM. I would start with 32 GB
or 64 GB — you can always order more!
A good RAM clock will not help you pre-process much faster. This
video puts it quite well: https://www.youtube.com/watch?
v=D_Yt4vSZKVk
Angel G says
2018-08-03 at 19:35
Hi Tim, I’ve re read the comments and few questions rouse in my head.
1. How it has been decided that 16-bit floating point numbers have enough
precision for the neural networks – doesn’t it reduce their recognition abilities ?
2. In 2010, I trained a C++ coded CNN. I’ve noticed that if I run in more than 4
parallel threads it’s learning rate decreased (required more epochs) . The weights
were updated concurrently by the threads using non-blocking(mostly) atomic
cmpxchg64 instructions. I’ve skipped all development until now.
Now massive parallel architectures are used (in GPU) so I wonder how do they
update/combine the weights in parallel without destroying the learning rate ?
3. Does CUDA vs AMD matter if I implement the neural networks in the old school
manner – without any SDK – Open GL shading language with floating-point
textures.
Reply
Tim Dettmers says

2018-08-06 at 17:51
Hi Angel. Here some answers, suggestions:

1. From my experience, 16-bit results are close to 32-bit results, but I did not do
thorough testing. There is research that points in both directions (in my
research I show that 8-bit gradients and activities work fine for AlexNet on
ImageNet). I think if you want to learn, experiment, prototype and do practical
work, 16-bit networks are alright. For academic work which requires state-of-
the-art results 16-bit methods might have insufficient precision.
2. Again, you might want to read my paper for an overview of synchronous
methods. The best methods compress the gradient and send them across the
network to other GPUs where everything is aggregated and synchronized for a
valid gradient. Such methods work quite well. There are other asynchronous
methods championed by Google, but they also suffer from the problems that
you describe.
3. Implementing in CUDA is much better because they have better tooling, a
better community, and thus better support. However, we need more AMD code
because it is not healthy if NVIDIA has more than 90% of the market share in
deep learning hardware. If you expect that your software will be useful for
many thousand others, AMD might be a good, ethical choice. If it is for a
personal project which may be useful to dozens, go with NVIDIA CUDA.
Reply
P says
2018-04-22 at 01:08
Hi Tim, does DL require I/O access to the storage (such as SSD)

during the learning process? If so, it is a lot of only a bit occasionally?
I guess lots of access are required at the beginning to load the data and only a
small amount of I/O access is required to save the learning weight. Am I correct?
Reply
Tim Dettmers says

2018-04-23 at 18:49
It is mostly loading data and an SSD is only required for performance if you
have very large input sizes. If you do not have that, you will gain no
performance over using a spinning hard disk. However, besides DL you use the
SSD for many other tasks, so if you have the money I would definitely go for
SSD drives to make I/O work more comfortable.
Reply
P says
2018-04-24 at 05:21
Thanks Tim, do DL related programs require many syscall or ioctls?
Regardless of cost, is there an advantage using i7-8700K rather than i5-

8400? Is i5-8400 sufficiently good enough to get things done?
Is it worth to pay more to get the Nvidia 1080 rather than the 1070ti?
Reply
Leorexij says
2017-12-06 at 16:24
Hi Tim,
one place to collect all excellent reference . It really helps people better configure
their machines to perform efficient deep learning.
I’d like to ask a question and would be grateful if you can help me.
I am going to use 2000 images at a time and I want to use tensorflow and theano
framework in pyhthon. Can you advise me the configuration to achieve this with
good performance.
And my budget is less than 50,000 INR
Reply
Abdelrahman says
2017-11-25 at 22:13
Hi Tim,
Thanks for the great article. I am planning to use 6850k for my deep learning box,
with 1070 Ti, 1080, or 1080 Ti GPUs, planning to extend to 4 GPUs later.
I just wonder if the following motherboard is a great option for deep learning box
(4 GPUs): MSI Extreme Gaming Intel X99 LGA 2011 DDR4 USB 3.1 Extended ATX
Motherboard (X99A GODLIKE Gaming )
https://www.amazon.com/MSI-Extended-Motherboard-X99A-
GODLIKE/dp/B014VITZPM/ref=cm_cr_arp_d_product_top?ie=UTF8
Reply
Nikolaos Tsarmpopoulos says

2017-11-16 at 20:02
Hi Tim,
On a 4x GTX 1080ti system, with 1x SSD for Windows 10 Pro, and 3x HDDs in RAID 5
for mass storage (encrypted with Bitlocker), in a secure multiuser environment, I’m
looking for an effective approach to separate storage from compute: I currently
have to reconfigure access rights in the RAID volume for the users, every time the
OS is reinstalled (see clean installation after breaking things).
I think it would make sense having a type 1 (bare metal) hypervisor to allow for
Windows and Linux VMs to access the hardware as needed. I’m considering a VM
for NAS, and two more VMs for Windows and Linux.
Do you know if this is possible abdomen, if so, which hypervisor allows for CUDA
from VMs running Linux and Windows to access the GPUs? Is there a particular,
tested software configuration that you can recommend?
Thanks in advance.
Reply

2017-11-16 at 20:05
Autocorrelation modified my question. Here it is :
Do you know if this is possible and, if so, which hypervisor allows for use of
CUDA from VMs running Linux and Windows, to access the GPUs? Is there a
particular, tested software configuration that you can recommend?
Reply
Levent says
2017-11-07 at 10:03
Hi Tim,
Thanks for this excellent article!
We’re trying to configure our new machine and thinking of buying:
ASUS Z10PE-D16 WS as the motherboard. It’s obvious that we can’t fit more than 3
GPUs on this mobo, but what about using ribbon extender cables and hanging the
GPUs, just like “mining rig” people do?
Do you think this will be a good idea, as this mobo has 4 x PCI-e x16, and 2 x E5
26xx will have 80 PCI-e lanes?
Reply
Julien says
2017-10-31 at 09:29
Hey Tim,
Sorry if I missed this point, but suppose I only plan on having 2 GPUs max. Would a
16 PCIe lane CPU work given that each GPU utilizes 8 lanes?
Reply
Tim Dettmers says

2017-11-20 at 18:47
Yes, for 2 GPUs, 16 lanes is plenty!
Reply
Shahab says
2017-10-28 at 18:42
Hi Tim,
Thank you so much for your great article. It really helps people better configure
their machines to perform efficient deep learning.
I’d like to ask a question and would be grateful if you can help me.
At this moment, I have a GTX 1080 Ti with an Intel Core i5 6500 with 8 GBs of ram.
My question is that, is it worth upgrading CPU and ram to a Core i7 7700 (or 6700)
and 16 GB, respectively? Would be any boost at all if I do this upgrade?
Reply
Clément says
2017-10-09 at 12:59
Hello,
I plan to build a machine with unbutu for deep learning.

I already have a GTX 1080 ti.
I’ll take 32 gb of RAM
My question is concerning the CPU.
Indeed i’d like to buy a i5-7600K. Is it ok with the 1080 ti? Will it work well if I decide
to add another 1080 ti with a SLI?
Reply
new_dl_learner says
2017-09-29 at 15:45
Hello Tim, what do you think of using the Threadripper compared with the i7
7700K, i9 7900X or 7940X? Two main concerns are: 1) there are user reviews saying
that under Linux, there are bugs with PCIe. Not sure if I will encounter such bugs if I
install 1-4 Nvidia GPU. 2) Lack of motherboard that supports PCIe 3.0 x16x16x16x16.
Some mentioned that there is no noticeable difference between x8 and x16. I guess
they talked about the frame rate for gaming. Not sure if this conclusion applies to
deep learning. Any idea?
Reply
new_dl_learner says
2017-09-30 at 15:47
Hello, I have spent too much time on hardware selections. It is driving me nuts.
I need some help. My current laptop computer is almost 10 years old. I am
building a desktop replacement. I also want to use it for DL/ML research. As far
as I know, latest CPUs such as the i7 7700K, i9 7900X and Threadripper do not
support PCIe 3.0 16x16x16x16. Motherboards that support such quad PCIe 3.0
at 16x each only support older CPUs with LGA2011-v3 socket or the Xeon E5.
These CPUs are at similar or higher price range than the ones mentioned
above. Moreover, they are running at lower speed. I guess the dilemma is:
Spending more on older technology for quad PCIe 3.0 16x16x16x16 vs.
spending less on the latest CPU that only support two PCIe 3.0 running 16×16.
Any suggestion appreciated. Thanks.
Reply

2017-09-30 at 19:03
Dear new_dl_learner,
The X99 motherboards with PLX chips (for quad PCI-E 3.0 x16) are no
longer produced, they are out of stock in most places, hence, no delima
here. As Tim has explained earlier, the CPU is not as important for Deep
Neural Networks as the GPU. Also PCI-E 3.0 x8 is not much worse than x16.
If you start with anything less than a Quad GPU setup, by the time you
need faster data throughput new CPUs and GPUs will be available, sporting
PCI-E 4.0, so you won’t be limited by PCI-E 3.0 x8.
Reply
new_dl_learner says
2017-10-01 at 00:42
Dear Nikolaos,
Thank you for the useful information. My laptop computer is i7
2.66GHz (8GB 1067 MHz DDR3, NVIDIA GeForce GT 330M with 512 MB
GDDR 3). Will that be sufficient for me to use it to learn about Deep
Learning and do some work in this area before PCI-E 4.0 comes out? If
not, what hardware do you recommend during this transition period?
Reply

2017-10-01 at 03:01
You can start experimenting with Deep Learning using the CPU of
your existing laptop and a compatible library. For example,
Tensorflow is available for CPU. Start with the basics.
When you reach the point where you need faster compute
capability, depending on your budget, you can put together a PC.
At that time, you will know what the requirements of the software
and of the models that you are using will be, so choosing the right
hardware will be much easier than it is today for you.
No need to rush it, it doesn’t help building a supercomputer today

if you don’t know how to use it for deep learning. By the time you
need one, new technologies will be available and you’ll be in a
better position to choose wisely.
new_dl_learner says
2017-10-01 at 14:07
Thanks for your suggestion. When can we build computers using

PCI-E 4.0 parts?
Yes, Tim mentioned that CPU is not that important but it was in
2015. Not sure about now. Some people mentioned that DL/ML
applications such as Tensorflow take advantage of multi-core,
multi-thread CPU and recommended getting at least 8 cores. My
laptop is showing signs of failing. Besides DL/ML, I also run
engineering applications that would benefit from higher clock
CPUs.
One concern I have is that in the GPU selection thread, somebody

asked about the possibility of doing DL on computer with a Nvidia
750M GPU. Tim mentioned that “So a GDDR5 750M will be
sufficient for running most deep learning models. If you have the
DDR3 version, then it might be too slow for deep learning (smaller
models might take a day; larger models a week or so).” Mine is an
older laptop with GPU using 512MB GDDR3. So, it will take days to
train even smaller models?

2017-10-01 at 18:17
PCIe 4.0 is reportedly due to be rolled out in early 2018. That

means new CPUs, new motherboards, new GPUs. NVIDIA is likely
to lead with the announcement of its volta GPUs, to be seen.
My understanding is that no matter how fast GPU and CPU you

purchase, there will always be a model that takes days to train and
another that’s too big to fit in your GPU’s memory, which is why
nvidia sells Quadro and Tesla.
Hence, I think the bottom line is: use whatever system you can
afford and justify for your education right now. By the time you
need faster hardware, you’ll know what you need, what you need
it for and what options are available at that time.
new_dl_learner says
2017-10-01 at 21:42
Thanks. Given the timeline, I better save the money than build a
top of the line 4-GPU system now.
Some mentioned that as deep learning is very computationally

intensive, having a fast system really helps during the learning
stage as a fast system would allow the learner to try out different
parameters and see how the results are affected without waiting
for days. I did research on neural networks when I was a PhD
student. However, I did not learn about deep learning. As it is just
neural networks with more hidden layers, I suppose it will take me
less time to learn it. Given the background, can anybody
recommend hardware to use during this educational transition
stage? How much RAM should I get?
An alternative is to replace my laptop first. I suppose if I get a

laptop, I should get a descent but not too expensive one. In deep
learning, will the mobile version of GTX 970M, 1050, 1050Ti, 1060,
1070, 1080 perform much better than my GeForce GT 330M with
512 MB? I know that for desktop, it is better to buy the 1080Ti. How
are the mobile version of the GPU ranked among each other? I
don’t want to buy a high end laptop that I could use the money to
build a high end workstation later on.
Reagan says
2017-09-29 at 15:33
Tim,
Great blog, this is my second post. You mentioned previously that of all the
characteristics of GPUs that RAM size and bandwidth were most important. I
haven’t seen anywhere on your blog you mentioned the relationship between
CUDA-cores and DL speedups. The meat of my question is, what is the real
difference between the 1070 and the 1080 as they both have 8GB of GPURAM? I’m
considering buying a 1070 for the cost savings over a 1080 for my toy DL rig.
Reply
Reagan says
2017-09-29 at 17:04
I pretty much answered my own question. The 1080 is only $70 or so more
expensive than the 1070 for 500 extra cuda cores and more bandwidth. I’ll go
for it.
BUT, my next question is really important. Quad-channel MOBO a must?
Reply
new_dl_learner says
2017-09-28 at 06:07
Hello Tim, is it better to get a CPU which supports AVX-512?
Reply
Tim Dettmers says

2017-09-29 at 14:55
If you do not use GPUs this might be a sensible investment, otherwise, it will
not that be important and I would not select a CPU based on this feature
alone.
Reply
new_dl_learner says
2017-09-29 at 15:28
Thank you for your expert advice.
Reply
James says
2017-09-24 at 17:44
Hi Tim,
Great info, thanks. I was wondering if you knew about comapnies that could offer
this service? I know NVIDIA used to build dev boxes but stopped? This would help
me focus more on dev and not worry too much about building the machine.
Thanks,
Jame
Reply
Tim Dettmers says

2017-09-26 at 10:58
There were some deep learning desktops from other companies, but I cannot
find them on Google. I think some of them might be buried in the comment
section somewhere, try to search that. Other than that, you could also just buy
a high-end gaming PC. Basically, there is no difference between a good deep
learning machine and a high-end gaming machine. So buying a high-end
gaming machine is a perfect choice if you want to avoid building your own
machine. I would still recommend giving building your own machine a shot —
it is much easier than it looks!
Reply
Jame says
2017-09-28 at 23:34
Hi Tim! Thanks for that. Will look around the comments. If i don’t find any
other solution i will have to start learning…The problem is the size that we
are looking at, maybe too big/challening for a single person,
Thanks!
Reply
Reagan says
2017-09-11 at 00:05
Tim,
What do you think about using the Xeon E-5 1620v4 instead of the i7-5930K for a
quadGPU machine? The Xeon is half the price of the i7, also has 40 PCIe lane
support, and has a higher memory bandwidth and is the same socket type.
Is there something about server chips I’m not seeing that would interfere with me
using this chip on an X99 mobo?
Also, is there a difference between DDR3 and DDR4 RAM that is meaningful to
deep learning?
Great blog!
Reply
Tim Dettmers says

2017-09-12 at 08:18
The Xeon is definitely a better option here. It has less cache and fewer cores,
but this should only have a minor influence. The chip should work normally on
a X99 mobo. For deep learning there is a very minimal difference between
DDR3 and DDR4. Probably the performance difference would be a few percent
which should not be noticeable unless you run the GPUs 24/7. However, if you
want performance for a 4 GPU setup, then the first thing you should look into
is cooling, in particular, liquid cooling. Other factors are insignificant.
Reply
Sumeet Singh says

2017-09-07 at 21:01
Hello Tim,
Thanks for this wonderful resource for Deep Learning DIYers. Based on this and
several other resources on the internet, I have built my first A.I. ‘rig’ on which I am
training a Image Captioning/Transcription/Translation Neural Network – Im2Latex:
to convert Latex generated images back into the original latex markup. I have an
convnet of about 14M parameters and my Conditioned-Attentive-LSTM has about
8M parameters. I’ve been running this on Google-Cloud-Platform before I built my
own ‘rig’ and am happy to report that my rig with one GPU trains almost twice as
fast as the one with one virtual CPU on Google (i.e. half a K80). I think I can make
mine a little faster with some BIOS settings – but am happy with it so far. Am
ordering another 1080 Ti soon. Oh yes, I haven’t yet overclocked my GPU but it
naturally runs at over 1900 MHz on load with the help of Nvidia X Server Settings
app on Linux (temperature is 50 degrees C with inbuilt liquid cooling with the CPU
at 30 degrees C)
In the spirit of giving back to the community, here’s my parts
list: https://pcpartpicker.com/user/Sumeet0/saved/#view=gFbvVn. Also, while I do
have a copy of Windows 10 I decided to use Ubuntu GNOME 16.04 LTS – mostly
because I’m very comfortable with Unix like operating systems since I’ve worked
on/with those for over 20 years. One problem with Linux though is that most
software utilities for overclocking and system monitoring run on Windows. As you –
and other resources on internet – say, the best way to overclock a GPU on linux is
to flash the BIOS. At the least that’s not convenient – especially for a newbie.
Thanks again for this wonderful resource.

Reply
Tim Dettmers says

2017-09-08 at 20:47
Thanks for your feedback — this is very useful for everybody here!
Overclocking does not increase deep learning performance by much, but it

helps to squeeze the last bit of performance out of the GPU. I think you can
expect something between 0-3% performance increase from that. Might sound
like not much, but put into numbers, this means that if your GPU runs 24/7 this
can be up to 40 minutes per day extra compute time.
Reply
Sumeet Singh says

2017-09-09 at 00:29
Thanks. Is good to know that I’m not missing out too much by not
overclocking (and that my choosing linux over Windows 10 didn’t cause me
a major performance disadvantage – since I could have very easily
overclocked Windows). Now I can focus on training my 23M parameters
Reply
new_dl_learner says
2017-09-04 at 21:27
Thanks. Some sites suggested 8-16GB RAM but I found recent posts suggesting
32GB or 64GB. It is not also uncommon to see posts from users using 128 or 256GB
RAM. What is a reasonable amount of RAM for home computer above which it
would be better to use online computing services from companies?
Reply
new_dl_learner says
2017-09-03 at 16:46
I read that CPU is not as important as GPU for DL. Just to make sure the number of
CPU cores is 2x the number of GPU. However, I also read that CPU cores could be
assigned to take part of ML/DL computation. Do, does that mean it is good to have
as many cores as I could get?
Reply
Tim Dettmers says

2017-09-04 at 03:31
More cores are always better, but it is also a question of how much you want to
pay. I think CPU cores = 2x GPU might be a bit much for the high range. If you
get 3 GPUs a 4 core is still sufficient. If you have 4 GPUs I a 6 core would also
be sufficient. I would however not recommend a 2 core for 3 GPUs. 4 cores for
a 4 GPU system is borderline, as it will be okay if you just run deep learning but
it might become a bottleneck as you run any other application in addition. So
choose according to your budget and according to your needs.
Reply
new_dl_learner says
2017-09-01 at 15:34
Both Intel and AMD announced an overwhelming number of CPUs in August.

Which CPU choice would be the best for ML/DL? At first, I considered the
Threadripper but there is no related motherboard that supports the running of 4
GPU at x16x16x16x16 at the same time. I probably get two 1080Ti but I may need
four later. Is there an advantageous in getting dual CPUs motherboard vs. getting 2
computers?
Reply
Sumeet Singh says

2017-09-07 at 21:46
I recently built my system

(https://pcpartpicker.com/user/Sumeet0/saved/#view=gFbvVn). Firstly, I would
recommend that you run your models on a cloud-platform first in order to get
a sense of what type of hardware you want. For e.g. in my case, the importance
of the CPU speed and RAM size is very minimal. All the work should be done
by your GPU which in my experience was about 60x faster than if I ran my
workload on the CPUs. However, IMO you should test your model out on a
cloud-platform first (and take measurements) to get a sense for yourself.
Second, my first instinct was to go with AMD processors since those are
cheaper – but within a day of researching I found out that you can’t find a
reasonable AMD GPU that will control even 40 PCIE 3.0 x16 lanes. The intel
Xeon E5s (and the new Silvers) will easily control 48 lanes per CPU. So that
settled the debate for me. I also wanted to run my experiments continuously
for several days – even weeks – therefore I preferred to go with server
components. Thirdly, there were very few consumer (Intel chipset X99)
motherboards that would simultaneously run 4 PCIE 3.0 cards at x16 speeds –
and they all cost $600-$1000. Most X99 motherboards that have 4 or more
PCIE x16 slots are only able to drive 2 of those at full x16 speeds. If you add
more cards, their speed will drop down to x8. I found a few server
motherboards (Intel chipset C612) that did drive 4 cards at x16 speed but at
same or cheaper price-point (I ultimately bought an ASRock Rack EP2612 WS
for $400). These three points made me decide to go with a server
motherboard with two sockets, and Intel Xeon E5 CPU (I’m only using one at
this time). I have the option to add upto 4 GPU cards all of which will be driven
at x16 speed with at total of two Xeon E5 CPUs. I can run two cards with one
CPU alone, and that’s what I intend to do in the near-term. If things work out
and if there is a need, I’ll add one more CPU and two more GPU cards.
Since I have my model all setup and running on Google Cloud platform as well
as on my own system, I have a very good comparison of speed and price. I will
recover the price of my rig in 5-10 months if I run with one GPU and 3-6
months if I run with two (and even sooner with 4 GPUs). This is based on
running my computations at least 12 hours a day every day (which reasonable
for my case). I did *not* factor in the fact that mine runs 1.5x to 3x faster than
GCP. I did factor in the cost of electricity though which is 40 cents / kWH for
tier-3 and 27 cents / kWH for tier-2 consumption in my area. The biggest
advantage for me though is that psychologically now I can leave my jobs

running all the time without worrying about accruing costs (not to mention that
GCP rations the use of resources and you have to convince them sometimes
when you want to increase your resource consumption – which I find very
perplexing). IMO this psychological advantage is a major plus when you’re
experimenting/researching because you can never experiment too much. As a
bonus, my system runs much faster which is priceless since I can iterate much
faster. That said, you should do your own calculations – I may have made
mistakes or overlooked something. Good luck with your project!
Reply
Tim Dettmers says

2017-09-08 at 20:48
This is very good advice and a thorough analysis — thank you for giving
back! This is very valuable and I should incorporate this advice into my
blog post.
Reply
Sumeet Singh says

2017-09-09 at 00:09
Hi Tim,
Glad you found my input useful. I found this entire forum more
informative than anything else out there on the internet from a deep-
learning DIY POV. Thanks again.
Here’s some detail about how my PSU calculations:

1. After much deliberation I went for a 1000W PSU for driving the 1
CPU + 2 GPU configuration instead of a 850W PSU. The nominal
consumption of this config would be 800 Watts (180 for motherboard
based on ASRock-Rack’s ‘Prime95’ performance-test results which their
wonderful support-rep shared with me, 85W for CPU, 250W per GPU
and 5Ws per fan(6) and 2.5W for liquid cooling pumps(2). The cheaper
option was to go for a 850W power supply which as you can see
would be running at at least 95% load. But since I want to run this
config 24×7, I want to give the PSU more head-room and hence opted
to go for the 1000W PSU – which would still be too loaded for my
comfort at 80% (I’d prefer 70% sustained load). But I’m already quite
over budget, so I’ll stick with it.
2. When/if I add 1 more CPU and 2 more 2GPUs – they will consume
an additional 585W – bringing the total nominal consumption to
1385W. Then I’ll buy another ~850W PSU – and run the two PSUs in
parallel. This way I can scale up the system incrementally. I haven’t
tested running with two PSUs yet, but I think it should work (despite
what many people will say on the internet) as long as you choose the
second PSU such that it will be okay with one not connecting the 24-
pin and EPS/ATX12V mobo pins (some may not put any voltage on the
line or suffer poor voltage regulation in this scenario, but I hope that
my seasonic will do fine), one ensures that any given component is
entirely powered by the same PSU and all the PSUs are grounded to
the same ground. So for e.g. ensure that all the motherboard power
sockets (one 24-pin and two 8-pin EPS/ATX12v sockets on my mobo)
are powered by the same PSU. I’ll update this forum when/if I do that.
An additional complication is that PSUs normally don’t start powering
the output lines when you turn on their power switch. They do that
only after they receive a control-signal from the motherboard (which it
does when you hit the start button on the computer-case) which they
receive on two pins on the 20-24 pin ATX power connector. You can
fake that signal to the second PSU by shorting the correct two pins or
you can buy a device that will relay the mobo’s signal to the second
PSU. I haven’t tried this out yet, so don’t know for sure if it will work but
people have done this successfully so I’m hopeful that I’ll be able to
make it work.
Reply
new_dl_learner says
2017-09-08 at 21:01
About getting a dual CPUs motherboard and have each CPU controlling
two GPU at the same time… I have two questions:
1. If I only have one CPU installed, can the motherboard control 4 GPU at
the same time at 16x16x16x16?
2. In case two CPUs are required to control 4 GPUs at the same time at
16x16x16x16, will DL software such as Tensorflow take care of the parallelism
and distribution of workload of the GPUs automatically?
Reply
Sumeet Singh says

2017-09-09 at 00:24
1. I didn’t find any mobos that will do that – but I didn’t seriously
consider anything that was north of $670. I suspect you could find this
feature in higher-end server motherboards (much more expensive).

You will also need a CPU that will drive at least 64 (i.e. 4*16) PCIe lanes.
I know of high end Intel CPUs that will power 96 PCIE lanes but those
are very expensive – I think the newest Intel Xeon Gold and Platinum
series (maybe some higher-end Silvers too).
2. Regarding whether Tensorflow will drive all 4 GPUs – that’s my
working theory. I’m training on tensorflow and everything I’ve read
and discussed with colleagues who work on DL tells me that that
should work out of the box. I know from experience that if you train on
CPUs, then tensorflow loads them all up (e.g. it would load all 32 cores
on a 4 socket / 32-core HP server machine). I have read that it will do
the same with GPUs. That’s my main reason for trying to build a 4GPU
system as against two 2-GPU systems. I just ordered my second GPU
and will know in 2 weeks whether this theory pans out. Hope to
update this forum then but I suspect that Tim will already know the
answer right away – therefore do ask him.
Reply
new_dl_learner says
2017-09-09 at 00:44
Thanks.
I guess one uncertainty is whether the two CPUs on the same

motherboard coordinate with each other well and automatically.
Supposing that CPU1 provides x16x16 to GPU1 and GPU2 while
CPU2 provides x16x16 to GPU3 and GPU4. When the user launches
a DL application from CPU1, will that application automatically
distributes the tasks to GPU1, 2 3 and 4 without any afford from
the user.
Sumeet Singh says

2017-09-09 at 01:00
I think Tensorflow should be able to do that. I’ll update the forum

when/if I have any real experience in this regard.
2017-09-09 at 01:19
Latest (Purley), high-end Xeons support 48 lanes per CPU, hence
96 lanes would only be supported from a dual CPU config. These
CPUs feature up to 3 Ultra Path Interconnect links, for CPU-to-CPU
communication, at 9.6 and 10.4 GT/s.
Source:
https://software.intel.com/en-us/articles/intel-xeon-processor-
scalable-family-technical-overview
That is expected to be similar in performance with the older

interconnect, QPI. Source:
https://www.nextplatform.com/2015/05/26/intel-lets-slip-
broadwell-skylake-xeon-chip-specs/
More information on QuickPath Interconnect and its throughput

(about 25Mbps), here:
https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect
I haven’t been able to confirm whether, in a dual CPU config,

Windows:
1) recognises that a particular process or thread utilises a particular

GPU that is attached to the PCIe lanes of a particular CPU,
2) assigns the aforementioned thread to the CPU where the GPU is
attached
If someone can help find information on this, please share it here.
If this is not feasible, then its possible that dual CPU configurations
may be slower than a single CPU motherboard that utilises a PCIe
switch (PLX).
Michael says
2017-09-09 at 01:34
***will DL software such as Tensorflow take care of the parallelism

and distribution of workload of the GPUs automatically?***
Of course not. You have to manually design your code to run on

different GPUs. Usually you would break up your data batches and
assign them to individual GPUs in a “multi-tower” fashion. If you
don’t understand these things, I recommend learning them before
spending any money on hardware. It’s like wondering how big of
an engine to get in a car before you learned how to drive. Just get
started with whatever hardware you have at hand. By the time you
know how to build large complex models, any specific hardware

suggestions given here will likely become obsolete.
Sumeet Singh says

2017-09-09 at 01:50
Very good point Michael. I (incorrectly) assumed that the

questioner already knew about building towers (as in the CIFAR-10
tutorial https://www.tensorflow.org/tutorials/deep_cnn) and that his
question was in the context of whether tensorflow would distribute
the load after that. But you’re right, this is perhaps not what he
meant.
new_dl_learner says
2017-09-29 at 04:31
Hello Sumett, any update?
Sumeet Singh says

2017-09-29 at 06:37
Yeah, my dual-GPU configuration has been up and running for a

couple of weeks now. It worked exactly as expected with
Tensorflow. I’ve coded a synchronous towers architecture as
described in the CIFAR10 tutorial. I got about 70-80% speed-up
(speed went up from 1.25 s/batch to 0.7s). I suppose I could get
closer to 100% speed-up if I coded a asynchronous architecture –
which I would do if I got more GPUs. Please note that you have to
tell tensorflow to place the operators on a specific GPU – and you
can do so easily by building multiple copies of your graph and
place each on a separate GPU (see the CIFAR10 tutorial). Therefore
from the beginning try to code your graph such that it can be
easily replicated with all subsequent instances reusing variables
created by the first instance. Tensorflow will automatically place the
variables on the CPU (provided you specify soft-placement in the
session config.) and take care of transferring their values to and
from the multiple GPUs. All you need to do is specify device-
placement and Tensorflow will do the rest – i.e. it will deploy your
graph-code to the multiple GPUs and CPU, transfer data back and
forth between GPU and RAM and coordinate the execution of the
entire graph spread-up over CPUs, GPUs and RAM. It takes very
little effort compared to how much work it does. Coding a
asynchronous model on the other hand, will take a bit more
coding. Oh and one more thing – be sure to use queues and
queue-runners for reading data asynchronously from the disks so
that the data is ready in RAM when the graph needs it. You’ll also
need to ensure that your BIOS is setup properly – for e.g. I had to
turn on the ‘Above 4G Decoding’ option on my motherboard. I
also noticed that if I turn off ECC, then the speed actually slows
down contrary to what I had expected. I also notices a ‘warm-up’
period of about 30-45 minutes after boot-up when the graph runs
3x slower – not sure why (maybe the time it takes for OS to load
inodes into cache?) but now I just suspend the machine instead of
shutting it down.
Hope this helps.
Tim Dettmers says

2017-09-29 at 15:02
Thanks for your comment, Sumeet, it is good to see some

discussion in the comment section!
James says
2017-11-09 at 21:04
Hi Tim,
After some digging I came accros this company that could help
me – Elysian ai. Do you know them?
http://www.elysian.ai
new_dl_learner says
2017-08-27 at 02:51
I cannot find a motherboard that supports threadripper and 4 x PCIe 3.0

x16/x16/x16/x16. How come such motherboard is not available?
Reply
Leo says
2017-08-28 at 15:07
You never will. Threadripper have 64 PCIe lanes, but you have to left 4 of them
for chipset and most mobo now will feed 4, 8 or 12 lanes to MNVe/SSD and
other disks.
Reply
Tim Dettmers says

2017-08-31 at 01:56
I checked NewEgg and indeed the X399 board’s specs show that they only
support standard PCIe setups. However, if you look at the manufacturer’s
homepage you will see that they indeed support full 64 PCIe lanes. I assume
Newegg system is not updated yet to make a 16x/16x/16x/16x system available
for the specs (it seems to be standardized). See for
example https://www.gigabyte.com/Motherboard/X399-AORUS-Gaming-7-rev-
10#kf
Reply
new_dl_learner says
2017-08-31 at 17:51
Thanks Tim. I am a bit confused. From the specifications, there is no

mention that it supports 4×16, only 2×16. The online menu states teh same
thing. Am I reading it wrong?
1. 2 x PCI Express x16 slots, running at x16 (PCIEX16_1, PCIEX16_2)

2. 2 x PCI Express x16 slots, running at x8 (PCIEX8_1, PCIEX8_2)
(The PCIEX16 and PCIEX8 slots conform to PCI Express 3.0 standard.)
3. 1 x PCI Express x16 slot, running at x4 (PCIEX4)
(The PCIEX4 slot conforms to PCI Express 2.0 standard.)
Reply
Tim Dettmers says

2017-08-31 at 18:16
You are right, I just for searching “lanes” and confused the 64 that I saw
with specs for the motherboard. This is strange indeed, why do they
not support the full 64 lanes? There was another blog post saying that
this particular board would support that, but the manufacturer’s page
clearly says it does not. I would get in touch with any manufacturer and
just ask.
Reply
new_dl_learner says
2017-08-31 at 18:23
As far as I know, even threadripper supports 64 lanes, none of the

available motherboard allows 4 CPU running at 16x16x16x16 at the
same time. So, if I want more than two GPU running at x16, I will
need to choose Intel CPU?

2017-08-31 at 20:29
The following article suggests that only 56 out of 64 PCI-E lanes

can be used for GPUs:
http://www.guru3d.com/articles-pages/amd-ryzen-threadripper-
1920x-review,4.html
I haven’t found a good explanation but I think it’s likely 4 lanes are
used to connect the cpu to the X399 chipset.
PCIE switches could be used (in future workstation motherboards)

to support more (than 3) GPUs at x16 lanes.
Roku com link says

2017-08-25 at 19:01
I am not certain where you’re getting your info, however great

topic. I needs to spend some time finding out more or working out
more. Thanks for magnificent information I used to be in search of this information
for
my mission.
Reply
Muhammad Abdullah says

2017-08-24 at 10:17
Hi Tim, I’m new to Deep Learning and Computer Vision and I need to build a
workstation for that within $1000 budget and I’ll considering used and low cost
components available in Pakistan. So far I have found following options.
Board + Processor + Casing + Power Supply

Dell T3610 with Xeon E5 Series CPU (12 Core, 35M or 30M Cahce, 2.0 GHz) $570
Asus X99 Motherboard with Core i7-5820K (6 Core, 15M Cache, 3.3 GHz) $237 +
$332 +$10 +$42= $621
MSi X99S SLi Plus with Xeon E5-2620 v4 (8 Core, 20M, 2.1 GHz) $801 + $20 + $42=
$863
HP Tower z820 with Intel Xeon E5-2687W (8 Core, 20M, 3.1 GHz) $550
HP Tower z620 with Intel Xeon E5-2650 (8 Core, 20M, 2.0 GHZ) $431
GPUs
GTX 1050Ti – $185 – 2GB
GTX 1060 – $400
GTX 1070 – $512
GTX 1080 – $711
Other Options Include Quadro 5000 with 2.5 GB and 382 bit
RAM
16GB DDR4 $142
16 GB DDR3 $33
HDD
500 GB $19
1 TB $33
SSD
128 GB $28
Please guide me about the most powerful and lost cost combination that will help
me in future. Also let me know if a better combination of motherboard and
processor can be made from available parts.
Reply
Tim Dettmers says

2017-09-01 at 16:22
You can save money by using the DDR3 ram option with a suitable
motherboard. The cheap E5 options look quite good to me. I would go for a
GTX 1070 given these prices. If you are short on money a GTX 1060 with 8GB of
RAM would also be okay. Hope that helps!
Reply
Fernando says
2017-08-12 at 20:06
Hi Tim,
I followed your guide to understand better my needs in the computer I want to
build for deep learning applications. However I have a question regarding the PCIes
from the CPU.
Specifically, you mention that 40 PCIe are good to go for 4 GPUs, and also
mentioned that every GPU communicates through 16 PCIe. In my mind if I would
like to use the full potential of the GPUs I calculate I would need 16×4 = 64 PCIe in
my cpu to make this communication efficient. I defintely misunderstood something
about that but I would love to know how did you came to this conclusion. So the
question basically is, how many PCIe does a CPU need? do I need more than the
ones that my GPUs demand? Is there any other component demanding this buses
and therefore necessary to have even more?
Thanks in advance for the information.

Best Regards,
Fernando
Reply
Tim Dettmers says

2017-08-13 at 20:41
Generally the more the better and while PCIe speed is not that important if you
only do parallelism among 4 GPUs it is still the easiest factor to improve (or
decrease if you do not have the lanes) your performance, Generally only
devices that are attached to the PCIe bus also draw lanes. For example, if you
have a PCIe SSD, this will also affect the transfer speed to your GPUs. The setup
in which your PCIe devices can run is specified by the motherboard. For
example you might have a 40 lane CPU, but your motherboard only supports a
8x/8x/8x/8x setup for your PCIe devices, in this case GPUs, so that so GPU can
utilize the full 16x speed.
In my research on GPU parallelism, it is usually the case that networking

performance is the greatest bottleneck. So if you have few lanes, your
algorithms are limited by how much they can scale. There are algorithms which
go around this, but currently only Microsoft CNTK supports those algorithms
(block momentum parallelism). So in general, having full 40 lanes (or rather 36
because one GPU with 16x speed is useless for deep learning if the other GPUs
do not have 16x) is a good thing to have. On the other hand, such CPUs and
motherboard that support 40 lanes are more expensive. From a cost efficiency
perspective it might be better to go with fewer lanes — you might just get
more bang for the buck.
The details are more complicated., but I hope this helps you to get an overview
about the issue.
Reply

2017-08-14 at 01:54
Fernando,
Also check if a PCIe switch (PLX) makes sense for the types of workloads you
will be creating.
Some motherboards feature PLX chips, to allow for 4 GPUs operating at 16

lanes each. Typically, a PLX chip connects to the CPU using 16 lanes and
handles 2 GPUs using 32 lanes. The paired cards can communicate with each
other (DMA) via the PLX chip, i.e without consuming any PCIe lanes on the CPU
and without being affected by any communication of the other pair of GPUs.
Also, note that PCIe uses separate lanes for downlink and separate for uplink,
i.e a device that supports 16 lanes, practically supports 16 lanes uplink and 16
lanes downlink, which can be used concurrently at full speed. This is beneficial,
when the software library uses the following approach:
If the workload can be split into four processing stages that take about the
same processing time and each stage can be handled by a separate GPU,
here’s how the data would be transferred at full speed: GPU1, GPU2 are
attached to PLX1 and GPU3, GPU4 are attached to PLX2. The CPU uses 16
(uplink) lanes to send data to GPU1 via PLX1. At the same time (in parallel),
GPU1 transfers the data it has just processed to GPU2 using 16 lanes via PLX1,
GPU2 transfers data it has processed to the CPU using 16 (downlink) lanes via
PLX1, the CPU transfers this data to GPU3 using 16 (uplink) lanes via PLX2 and,
similarly, GPU3 transfers data it has processed to GPU4 using 16 lanes via PLX2;
GPU4 transfers the data it has processed back to the CPU using 16 (downlink)
lanes via PLX2. You’ll notice that the GPUs make use of all 16 PCIe lanes
available to each of them and the CPU also makes full use of 32 lanes (on both
directions, up link and downlink).
In other words, your software can potentially make optimal use of 16-lane
GPUs, via a CPU with 32 available PCIe lanes, if it only needs to send data from
the CPU to the first GPU and receive data concurrently from the last GPU (in a
sequence of GPUs, where each GPU does some processing and forwards the
data to the next, for further processing) back to the CPU. The workload needs
to be balanced, so that GPUs don’t wait too long.
I’m aware of two EATX motherboards with this feature and they are quite
expensive and I’m not sure if the additional cost can be justified in terms of
performance:
ASRock X99 WS-E/10G

ASUS X99-E WS
Regards
Reply
new_dl_learner says
2017-08-14 at 13:32
When looking for a motherboard, do I need to ensure something like:

NVIDIA Quad SLI, 4-Way SLI, 3-Way SLI, SLI Technology
if I plan to use more than one 1080Ti in the same machine?
Reply

2017-08-14 at 17:02
You will not be using SLI for deep learning.
I suggest that you ensure the motherboard has enough PCI-E x16 slots
for future expansion (as Tim has advised) and, if you are concerned
about the number of lanes that will be available in a multi-GPU setup,
you will need to download the manual (in pdf ) of the motherboard
you’re interested to buy and check the number of lanes according to
the number of GPUs.
Without a PLX chip, depending on CPU lanes, the manual could say for
example 2 GPUs at 16/16, 3 GPUs at 16/8/8, 4 GPUs at 8/8/8/8. A
motherboard with PLX typically says 2 GPUs at 16/16, 3 GPUs at
16/16/16, 4 GPUs at 16/16/16/16.
Reply
new_dl_learner says
2017-08-14 at 23:25
Thanks. I heard that although the Threadrippers have lower CPU

clock than the best Intel CPU, they support 64 PCI-E lanes and
Quad-Channel DDR4 memory. Does that mean the TR4 socket
motherboards would support running 4 Nvidia GPU at the same
time?
About using more than one GPU for DL, it seems that I need to
write software to take advantage of parallelism. Isn’t the use of
multiple GPU to solve problems automatic? I mean when more
than one GPU is installed, the hardware and software (e.g.
tensorflow) automatically detect the existence of multiple GPUs
and divide the task to all the installed GPU automatically.

2017-08-15 at 02:26
Indeed, Ryzen Threadripper comes with 64 lanes but my

understanding is that some of these are reserved for other
motherboard features, eg. chipset, M. 2 slots, etc. Check the
following, as an example :
“4 x PCIe 3.0 x16 (single@x16, dual@x16/x16, triple@x16/x16/x8

mode)”
Source : https://www.asus.com/Motherboards/ROG-Strix-X399-E-
Gaming/specifications/
It advertises 4 x PCIe 3.0 x16 but then explains that the third GPU
will be operating in x8 mode. Check the detailed specs in the
manual before you buy a motherboard.
More lanes on the CPU is a good thing, but I suspect a single

threadripper will support up to 3 GPUs at x16 and the server
equivalent will support up to 6 GPUs at x16.
Parallelism is implemented differently in each library. Read through

Tim’s articles, he has provided some very helpful info in his blog.
Also, check each library for updates, as they are gradually
improved by their respective authors.
new_dl_learner says
2017-08-18 at 17:29
I asked Asus. Their reply is:
” I understand that want to the specifications of ROG Zenith

Extreme and ROG Rampage VI Extreme. I know that as a computer
user you want to customize your motherboard on your own
preference to use it to its full potential. Let me continue assisting
you with your concern.
For the two motherboard, it can support 4 x PCIe 3.0 x16 (x16,
x16/x16, x16/x0/x16/x8, or x16/x8/x8/x8) at the same time since they
didn’t share any bandwidth with any of the slots in the
motherboard.”
Does that mean for these two motherboards, I can use four 1080Ti
GPUs running at top speed at the same time?
In another reply, support mentioned that even I added a SSD or

other PCI cards, the four GPU can still run at top speed.

2017-08-18 at 21:26
My understanding is that “4 x PCIe 3.0 x16 (x16, x16/x16,

x16/x0/x16/x8, or x16/x8/x8/x8)” means:
1 GPU at x16, i.e at full speed

2 GPUs at x16/x16, ie. both cards at full speed
3 GPUs at x16/x16/x8, ie. two cards at full speed, the third at half
speed,
4 GPUs at x16/x8/x8/x8, ie. one card at full speed, three others at
half speed.
Hence, a motherboard with the aforementioned capability does

not support more than 2 GPUs at full speed.
I suggest you go back to the manufacturer and ask them to clarify

their position. You can ask them, for example: if you attach 3 or 4
GPU cards and each card supports x16 bidirectional PCIe lanes, will
the particular motherboard (of interest) allocate, to each card, x16
dedicated bidirectional PCIe lanes, for concurrent transfer of data
between the cards and the CPU?
If the motherboard supports 4 GPUs at full speed, it’s typically

reported as x16/x16/x16/x16.
new_dl_learner says
2017-08-19 at 02:02
Thanks Nikolaos. I will ask support as you suggested.
I may be wrong but I get the impression that he might be trying to

hide something. For example, in a previous email, he wrote as
following.
Hmm… What is “For the full speed, it actually depends on how you
will use the 4 ROG-STRIX-GT1080TI-11GB at the same time. “?
He also suggested getting an overclocked version of 1080 Ti. Does

overclocked version perform noticeably better?
—————-
“The ROG Zenith Extreme again can work with 4 graphics card
since it support multi-GPU and supports 4 way SLI Technology. For
the full speed, it actually depends on how you will use the 4 ROG-
STRIX-GT1080TI-11GB at the same time. I can still recommend the
Zenith if you don’t want to overclock your GPU. Your GPU speed
will not lower down even if you connect an SSD or another
expansion card since there are no bandwidth between the PCIE
slots. For Intel i9 processor I can suggest the ROG Rampage VI
Extreme.”

2017-08-19 at 02:42
Tim has an excellent article on GPUs, where he explains the pros

and cons of different types of GPU coolers. I suggest that you read
patiently Tim’s articles and the responses he has given to other
readers. There’s wealth of information here:
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
new_dl_learner says
2017-08-12 at 15:52
Thanks Tim. How the number of GPU card scales with the performance? For
example if I have X*1080 Ti installed on the same computer, will it take 1/X the time
to complete the same task?
Reply
Tim Dettmers says

2017-08-13 at 20:35
Scaling within one computer is usually quite good. It still depends on the task,
but you can expect a scaling from 2.5-3.9 for 4 GPUs depending on the
software framework. The main drawback is that you have to add more special
code with handles the parallelism. I recommend PyTorch for these kinds of
tasks.
Reply
new_dl_learner says
2017-08-06 at 03:04
Hi Tim, I have a PhD in Computer Science but I have not worked on DL before. For
CPU, do you recommend the AMD Threadripper, Xeon or Core i7-7700/7700K? I
plan to buy a 1080 Ti first and if needed, add more later.
Reply
Tim Dettmers says

2017-08-07 at 21:21
Any of the CPUs that you listed is fine for deep learning with multiple GTX 1080
Tis. Choose the CPU according to your additional needs (preprocessing, other
data science applications, other uses for your computer etc).
Reply
new_dl_learner says
2017-08-10 at 20:05
Thanks Tim. As I know, software for my other needs do not take advantage
of multi-core. So, faster CPU is better than having more cores. Do software
related to deep learning take advantage of multi-core, multi-thread? If so,
about how many cores and threads of CPU would be advantageous? AMD
and Intel have different system/memory bandwidth. Which would be
better?
Reply
Tim Dettmers says

2017-08-11 at 03:14
Most deep learning libraries make use of a single core or do not use
other cores in full. Thus CPU with many cores does not have a great
advantage over others.
Reply
new_dl_learner says
2017-08-11 at 14:37
Thanks Tim. Regard to the GTX 1080 Ti, there are several
companies selling cards of different variants using the GTX 1080 Ti,
which brand and variant do you recommend? I plan to buy one
card first and if needed, add more later.
Somewhere I read a recommendation to stay away from the

reference edition which I read is also called the Founding edition.
Do overclocked 1080 Ti performance significantly better? In the
past, overclocked systems tended to fail sooner. Not sure if it is
worth.
Tim Dettmers says

2017-08-11 at 17:35
I would recommend the cheapest card. The cards are almost the
same. Overclocked cards have almost no benefit for deep learning
(for gaming they do, though). I am not sure about the Founding
edition — have not heard anything bad about it other than other
cards being cheaper.
Thorsten Beier says

2017-07-20 at 09:19
Hi Tim,
thanks for this great guide!
It helped us to choose a deep learning server for tensorflow. We now use this rack
machine https://www.cadnetwork.de/de/produkte/deep-learning but with four Tesla
P100 instead of GTX 1080 Ti. But i don’t know if there is a huge difference between
GTX and Tesla.
I can confirm that 1-3 GPUs are used fully and the fourth GPU deliver about 40% of
their performance. It could be a limitation of the PCIe Bus.
Thanks
Thorsten
Reply
Tim Dettmers says

2017-07-20 at 17:21
Hi Thorsten,
that is interesting. I do not think that the 40% performance comes from PCIe
issues alone, there might be another thing amiss. It cannot be some cooling
issue since then you would see a performance degradation with other GPUs
too. It would be interesting to know the reason for this. Let me know if you
know more!
I am happy that my guide helped you to choose your server! Indeed, Tesla
GPUs are only minimally better than GTX GPUs. The P100 is quite a bit better
than the GTX 1080 Ti, but it also costs un-proportionally more. I think GTX 1080
Ti would have been more cost effective, but often these are not available for
servers (NVIDIA has the policy to sell GTX cards only to consumers and Tesla
cards to companies), so overall not a bad choice!
Reply
Johydeep says
2017-07-14 at 20:31
Hi Michael /Tim
I am looking for one deep learning PC and I found this “Intel Core i7-7800X
Processor”
with
Socket LGA 2066
Compatibale with Intel® X299 Chipset
6 Cores/12 Threads
Max Number of PCI Express Lanes 28
Intel® Optane™ memory ready and support for Intel® Optane™ SSDs
AND
MSI Performance Gaming Intel X299 LGA 2066 DDR4 USB 3.1 SLI ATX Motherboard
(X299 GAMING PRO CARBON AC)
Is this good choice for 2 1080Ti with full speed?
Thanks
Johydeep
Reply
Tim Dettmers says

2017-07-17 at 05:03
It looks reasonable. With 28 lanes you will have a bit slower parallelism, but for
2 GPUs this bottleneck is not too large so you should still be fine; I guess you
could expect a performance decrease of 10-15% for parallelism with 2 GPUs,
which is okay. Otherwise, the specs are quite good for general computation, so
if you want to use your CPU for other data science tasks this is a good choice. If
you want to only do deep learning I might go for a slower CPU which has more
lanes, but your current option is also not too bad.
Reply
Tom says
2017-06-30 at 09:06
Hi Tim,
Thanks for your great blog.
Can i use “GeForce GTX 1080 Max-Q” laptop for deep learning task?
Here is the full description. I need something really portable but same time i need
to be able to train RNN models.
HIDevolution Asus ROG Zephyrus GX501VI-XS74-HID1 Black 15.6″ w/ IC Diamond

Thermal Compound on CPU+GPU – Optimal System Temperatures (FHD/i7-
7700HQ/GTX1080 Max-Q/512G PCIe SSD/16GB RAM)
https://www.amazon.com/HIDevolution-Zephyrus-GX501VI-XS74-HID3-Diamond-
Compound/dp/B0736C1PP5/ref=sr_1_7?ie=UTF8&qid=1498792741&sr=8-
7&keywords=gx501vi&th=1
Reply
Tim Dettmers says

2017-07-05 at 00:52
The GPU in that laptop is quite powerful so you will be able to train RNNs
without any major problems. It also should be quite fast compared to, say, a
GTX 1060 which will be quite a bit slower.
Reply
Tom says
2017-07-07 at 21:16
Cuda core is fine but apart from that everything else is 30% less compare
to main GTX 1080.
Will there be any performance issue during training large Model ?
Reply
Tim Dettmers says

2017-07-08 at 02:09
You can expect the card to be about 30% slower, but that is still pretty
fast compared to other cards. You might need to adapt your models
slightly or use 16-bit precision for very large models, but you should be
able to run everything that is out there.
Reply
Mirela says
2017-06-25 at 17:09
Hi Tim,
Thanks for the thorough write-up, it is truly helpful.

I am in the process of picking parts for a deep learning machine myself, and I have
a focus on graph computation and network analysis.
Do you have any insights on whether an i5 7500 (3.5 ghz) with 1060 6GB and 32
RAM would do (in an mini-ITX setup), or should I go for an Xeon E3-1225 (3 ghz)
with the same GPU but possibly more RAM (64 GB)?
Thanks for your efforts to share knowledge so far!

Mirela
Reply
Mirela says
2017-06-28 at 14:57
Just a small update on my part:
I assume working with node2vec would fit my context best.

(https://snap.stanford.edu/node2vec/)
Also, I’ve found
https://www.google.nl/url…gL8EYt_ZHetggs1UnH7HU14uA
(link to pdf, via CWI)
(perhaps interesting for you as well of course)
And upon long pondering, I assume the Xeon E5 1620 v4 is a wiser choice
compared to an i5/i7 setup.
Xeon is mentioned here, as well as is graph processing for a similar setup:
https://www.youtube.com/watch?v=875NbdL39A0&feature=youtu.be&t=243
+
https://www.youtube.com/watch?v=875NbdL39A0&feature=youtu.be&t=445
I’ve already invested in a ‘good’ (what budget could hold GPU, namely the
gtx 1060 6gb, and ram would be 16 or 32 gb as well (already useful for R).
But now it seems the Xeon would be the best option.
Regardless, many thanks!

Reply
Tim Dettmers says

2017-07-05 at 00:48
That sounds reasonable. If I were you I would also pay good attention to
the motherboard. If it has extra RAM slots (8 slots) then you can always
increase the RAM size if you need more; in that way you can upgrade your
setup depending on the problem that you are working on.
Reply
Tim Dettmers says

2017-07-05 at 00:44
You might want to go with the 64GB setup depending on what kind of graphs
you will work. The graph structure can differ greatly and some graphs will
require you to have more than 100GB of RAM while for others it is more
manageable. The CPU if often less important (but still depends on graph and
problem, so check this for problems/graphs you work with). A GTX 1060 might
be a bit slow at times, but often you do not work with the full graphs anyways
because training would take too long. Thus you could also trim down your
graph further and then a GTX 1060 is a solid choice (no large memory required
and good speedup over the CPU).
Reply
Mirela says
2017-07-14 at 14:09
Hi Tim,
Many thanks!
I have bought the components for below listed setup, aiming at having as
much RAM as possible (‘affordable’ :).
– intel xeon e5 1620 v4
– supermicro x10 srl-f

– kensington ddr4 32gb RAM (lrdimm) 2133 mhz
-> I plan to have in total 8 modules of these, in total 256 gb
(it’s even possible to go up 375 based on the cpu, and 1 tb based on the
motherboard)
– 256 gb SSD
– 1 tb HDD
– msi geforce gtx 1060 6gb
– noctua cooling
– lepa power, 800W
Have seen https://event.cwi.nl/grades/2016/00-Leskovec-slides.pdf also,

seems like RAM is very relevant indeed
Looking forward to some graph processing!
And thanks again!
Reply
Felix Dorrek says

2017-06-16 at 12:13
Hi Tim,
many many thanks for your great blog articles, they are a great help!
I have a perhaps a bit off-topic question. Can you recommend any resources to
learn about computer hardware on a conceptual level? So I am not really interested
in the underlying electrical engineering just yet, but about different components
and how they interact. For example I’m interested in how data is being transferred
from memory to GPU-memory in more detail.
Thank you again,

Felix
Reply
Tim Dettmers says

2017-06-16 at 20:08
That is a good question, but unfortunately, I do not have a good answer for
that! I also wanted to learn more the conceptual side of hardware, but the
resources that I found are often resources from universities and textbooks
which also look at the details. What I found most promising was to just do
google searches for specific questions and try to get informed through multiple
sources of websites. For example googling “cpu to gpu memory transfer” will
yield blog posts, forum questions, presentations on the topic and so forth. With
that, you can get informed about that question. From here you might have new
questions which you can then google. If you do this for a few hours every
week, you will get quite knowledgeable about concepts quite quickly. Hope this
helps!
Tim
Reply
Felix Dorrek says

2017-06-18 at 18:53
Thanks, I’ll do that then.

I guess I was intially hoping that there was nice resource to simplify the
learning process
Reply
Alderasus Ghaliamus says

2017-06-03 at 08:35
First of all, thank you very much for how comprehensive and deep knowledge you
gave us through the two blogs: the full and the GPU focused.
I wish that you could answer my question, with advance apologies if my question
asking the obvious.
I was about to spend around £3800 on a PC (the new ALIENWARE AURORA) which
has two GeForce GTX 1080 Ti, 64GB DDR4 at 2400MHz, and Intel Core i7-7700K
Processo. I was very happy that I finally could decide which PC I should buy for my
PhD research the next two years. What made me more happy that I was following
your appreciated GPU-focused blog – YES! I have multiple high performance GPUs!
However, something took me to the other blog – this blog – and I read the CPU
advice ending with the fact that my CPU is only 16 PCIe lanes – not 40 as you
warned. I went back the first step in my searching for a PC, before 4 months.
I did my best again, [focusing only on built PCs by Dell or Lenovo], and I ended
with another ALIENWARE PC – the ALIENWARE AREA-51, which has same* GPUs
and Memory in the first PC in my comment, however it has different CPU which is
i7-6850K with 40 PCIe lanes and 3.0 ER. However, the cost went up by more £700:
£4500. It is expensive, I could and will afford it for my PhD, but it is expensive.
When I reached such cost, I remembered two Laptops from which I ran away
because their costs. I said to myself, if I reached £4500 with the PC, why not go with
the life laptop with one or two more thousands. The laptops are:
(A) ThinkPad P71.

– CPU: Intel Xeon E3-1535M v6 (certainly, back to 16 PCIe lanes).
– Memory: 64GB(16×4) DDR4 2400MHz ECC SoDIMM
– GPU: NVIDIA Quadro P5000 16GB (no two GPUs).
– SSD: 1TB
-HDD: 1TB
-Cost: £6200
(B) Dell Precision 7720:

– CPU: Intel Xeon E3-1535M v6 (certainly, back to 16 PCIe lanes).
– Memory: 64GB(16×4) DDR4 2400MHz ECC SoDIMM
– GPU: NVIDIA Quadro P5000 16GB (no two GPUs).
– SSD: 1TB
-HDD: 2TB
-Cost: £5800
So, my choices are as following:

Choice One: new ALIENWARE AURORA as PC + XPS 15 £1800 as Laptop = £5400
Choice Two: ALIENWARE AREA-51 + my very old crying coughing laptop = £4500
Choice Three: no PC + ThinkPad P71 = £6200
Choice Four: no PC + Dell Precision 7720 = £5800
If you could please, and really I am so sorry to have you and your appreciated time
reading this long comment, help me with selecting one choice or arranging them
with your reasons, you will make my next two years technically truly safe. I have to
say that my research is on two different data spaces: genetic data and textual data.
Finally, thank you again for your contribution through this blog, and thank you in
advance for getting this point reading my comment.
*To be fair regarding the cost of the second PC, it has 2 more TB HDD [4TB] than
the first PC, however it provides the same size of SSD: 512GB.
Reply
Tim Dettmers says

2017-06-05 at 03:28
These are all solid options albeit all quite expensive. Note that PCIe lanes are
not that important if you have 2 GPUs, but become more important if you have
4 GPUs. However, I do think the biggest issues here is just that these computers
are too expensive. If I were you I would go for a used computer solution which
I would upgrade to your needs.
For example, I just last week sold my used computer, which is similar, or even
better than these options for 800 pounds on gumtree. So a smart choice might
be to buy a used computer and upgrade it with some parts. For genetics
research I would try to find a cheap computer which support 8 RAM slots and
than buy 64 GB RAM for the machine and upgrade to 128 GB of RAM if your
research requires this. Speed of the RAM is overvalued; a plain DDR3 RAM
setup is sufficient and cheap. For some deep learning algorithms or algorithms
in computational biology a single GPU should be sufficient but choose one that
has a lot of RAM; a 12GB is ideal and I would go for a used GTX Titan X for
400-500 pounds on eBay (make sure your computer has a PSU which at least
support 600 watts).
This option would yield a very high performance computer for roughly 2000
pounds. Of course it requires some manual assembly, but it really is not difficult
and you really should try to do this.
If you cannot get a used option with parts due to university bureaucracy I
would go with a ordinary laptop + a hetzner.de GPU machine which for a 3
year PhD will cost 4400 pounds but offers everything that you need and can be
canceled / upgraded month-wise. For most genetics research you should be
fine without a GPU which would cost 2150 pounds on hetzner.de. If your
algorithms require double precision then you will need to make a careful choice
about which GPU to get, but probably the most cost efficient solution would
involve renting some Tesla GPU in the cloud (AWS for example) to work with
double precision when you need it.
So the main options that I see are (1) buying used computer and upgrade its
parts, (2) buy ordinary laptop and a dedicated machine in the cloud. These
options will give you the best performance per quid.
Reply
Alderasus Ghaliamus says

2017-06-06 at 14:00
Don’t know how to thank you. Your generosity representing in your reading
time and response is profoundly appreciated.
According to must limitations in buying ‘new’ and ‘high performance’ PC or

laptop, I will take your appreciated advice for the future. Now, I believe that
I will go with the first choice in my list hoping that two GPUs will be enough
most of the time.
Again, thank you very much.
Reply
Raja says
2017-06-02 at 10:17
So I’ve purchased a ryzen 1700x and a msi x370 sli plus

I am wondering if it is going to bottleneck 2x 1080ti for deep learning?
ryzen only has 24 pcie lanes
the board only supports

x8/x8 in pci 3.0
if i understand correctly that is almost the same as x16 pci 2.0
Is x16 pcie 2.0 single gpu a bottleneck?

is x8/x8 pcie 3.0 going to bottleneck 2 gpus?
a 1080ti has 11gbps ram

but x8 pcie 3.0 would be only around 8gbps while x16 3.0 is around 16gbps
Can a modern gpu stream in new data while it is doing calculations?

in which case some of the 11gbps is being used for compute and some of it is being
used to stream new data in?
Bit of a deep learning rookie here.

Grateful for any advice.
Also will the dual channel ram in ryzen be a problem?
Should i build a threadripper machine?

As of now I don’t plan to use 4 gpus.
So as long as my current build doesn’t have a major problem i will use it with 2
gpus
and then next year do a 4 gpu build with canonlake/10nm
Reply
Tim Dettmers says

2017-06-02 at 16:21
It depends on the algorithm but in general PCIe lanes with 2 GPUs are not that
important. If will decrease performance but not by a lot. Maybe 0-10%
depending on the use-case. You should be fine.

Reply
Tom says
2017-05-25 at 02:59
What is the best way to put 4 GPUs (NON-founder edition) easily in a board?
Thanks
Reply
Michael says
2017-05-25 at 01:12
i7-7700k costs over $300. That’s not cheap. For that kind of money you can get a
CPU with 40 lanes (e.g. E5-1620v4), and put it into something like ASRock X99
Extreme4 board. Or you could pay more for Asus X99-WS board which has 2 PLX
switches and supports quad PCIe x16.
Reply
Tom says
2017-05-25 at 01:16
Thanks, Michael.
Reply
Sam says
2017-05-25 at 01:54
That’s true, although I’d like to have something newer/faster than ivy bridge as I
do use this machine for more than just deep learning. If I really wanted to save
money I could use the supercarrier board with a ~$40 kaby lake G3930.
Reply
Michael says
2017-05-25 at 02:15
E5-1620v4 is Broadwell, this is the latest generation of Xeon architecture. It

has much better memory bandwidth than 7700k.
Reply
Sam says
2017-05-25 at 02:32
Ahh my mistake! And you’re correct about the memory bandwidth,

although how relevant is that considering the DMA bandwidth
bottleneck for CPU memory –> GPU memory transfer?
Reply

2017-05-25 at 03:17
With a 40 lanes cpu you can have the CPU concurrently

exchanging data with 2 GPUs at full bandwidth.
With a x16 lanes CPU you’re limited to either x16 lanes to one GPU
at a time, or x8 lanes for concurrent exchange of data with both
GPUs.
As Tim mentioned earlier, for 2 GPUs you’re fine with x8 per GPU.
Having said that, 16 lanes on the cpu may not be sufficient, as
some of these lanes maybe reserved by the chipset or other PCIe
devices, eg integrated M.2 slot.
I would opt for a cpu with more PCIe lanes, as Tim and Michael
have advised.
Sam says
2017-05-13 at 18:52
Hi Tim,
Thanks so much for writing all of this up, it’s very informative. I’m currently picking
out parts for a DL machine, and I’m trying to figure out where I may have
bottlenecks.
Your piece on DMI for ram to vram transfer is quite interesting. Most of what I’m
reading emphasizes high pci-e bandwidth. I’m building a dual gpu system, and I’m
wondering if I really need both gpus running at pci 3.0×16, or if x8 is fine for each?
It sounds like the DMA bandwidth could be a problem. I couldn’t find much info on
DMA related to specific chipsets, however you mentioned 12GB/s. Is this bandwidth
the same for different chipsets (I’m comparing z270 to x99). If I’m mostly running
independent models on each GPU, would I see much if any benefit to 2x pci-3 x16,
or would that only really show a big benefit when running the gpu’s in parallel for a
single model? Asynchronous mini-batch allocation is interesting, however I’m not
sure if it’s integrated into all of the newer high-level DL frameworks…
Re: the DMA issue, intel’s new optane drives are routed through PCI, and they can
be used as ram in addition to long term storage. Do you think that these can be
used as a way around the DMA bottleneck??
Reply
Tim Dettmers says

2017-05-15 at 17:52
If you use the right algorithms there will be almost no decrease in performance
if you use x8 for each GPU. Even if you use the “wrong” algorithms,
performance reduction should be minimal for most models since aggregated
transfer-times for 2 GPUs are not that large. The costs increase dramatically as
you add more GPUs though — for a 4 GPU system it is important that you are
on PCIe 3.0 with at least 32 PCIe lanes from your CPU/motherboard.
I would not care too much about DMA. I suppose for most chipsets / CPU
combos it is the same. It might differ a bit here and there, but the performance
difference should be negligible. I recommend using PyTorch for parallelism if
you have a two GPUs and if you have 4+ GPUs I recommend Microsoft’s CNTK.
Reply
Sam says
2017-05-24 at 23:19
Thanks! After days and days of research I just ordered my hardware.
I thought I’d mention, theres a motherboard that I feel is perfect for dual
gpu rigs: The ASRock Z270 supercarrier:
http://www.asrock.com/MB/Intel/Z270%20SuperCarrier/index.asp
This board, like some of the x99 workstation boards has a PLX switch,
allowing dual pci x16 or quad pci x8 on a Z270 board. For dual gpu rigs,
you get the added benefit of being able to run 2 gpus 4 slots apart
(instead of the usual 3 slots on most non-workstation boards). This helps a
lot with cooling since there’s more space between the 2 gpus, especially
with non-reference coolers taking up 2.5-3 slots these days.
Reply
Tom says
2017-05-24 at 23:50
Hi Sam,
What Processor are you using with ASRock Z270 supercarrier?
Intel i7-6850K Processor ???
Does this motherboard support 40 Lane ?
Thanks
Tom
Reply
Sam says
2017-05-25 at 00:42
Hey Tom,
I’m using a i7 7700k. The ASRock Z270 Supercarrier is a Z270

board, so it takes LGA1151 chips, which are limited to 16 lanes,
therefore CPU’s for this board only have 16 lanes. Just like how
some of the Asus/ASRock X99 workstation boards have dual PLX
chips that allow 4x pci x16 with a 40 lane X99 cpu, the Z270
Supercarrier allows a cpu with 16 lanes to run 2 GPUs at pci x16. Of

course it’s not that simple – I don’t think the performance is
identical when running PLX chips to get “more” pci lanes than your
CPU has, but my understanding is that the GPUs can communicate
with each other at x16. Same reason why the Nvidia dev boxes use
X99 chips with 40 lanes yet operate 4 cards at x16 through a
workstation board with plx chips, the Supercarrier lets us run 2
cards at x16 with a 16 lane Z270 cpu. It’s much cheaper than a i7-
6850k yet allows for similar GPU bandwidth. I think that Z270 also
has some other nice things that X99 doesn’t because it’s much
newer, although X99 does support some things that Z270 doesn’t
like quad channel memory. Lastly, having a board with 4 slots
between GPU’s is nicer for SLI when you’re air cooling.
Michael says
2017-05-24 at 23:52
Dual PCIe x16 should not need any PLX switches. The switch is only
needed when you want to do Quad PCIe x16, which is more lanes than
a single CPU can support.
Reply
Sam says
2017-05-25 at 00:43
Dual PCIe x16 doesn’t need switches when using a 40 lane CPU,
however having a switch on a Z270 board allows me to use a
much cheaper, still very powerful 16 lane CPU with 2 GPUs at x16.
Tom says
2017-05-25 at 00:49
Non-X99 motherboards and a normal processor which does not

have 40 lanes can not run 2 GPU with the full power which is 16X,
is that correct?
So in my understanding ASRock Z270 supercarrier is perfect

becase it allows having 2 GPU using 16x and 3 M.2.
99.9% motherboard does not space to put 4 GPU (Overclocked

ones) without water cooling installed. 4 Founder edition GPU can
be installed in motherboard like “ASUS LGA2011-v3 Dual 10G LAN
4-Way GPU ATX/CEB Motherboard (X99-E-10G WS)” which provide
full 16x utalization.
Tom says
2017-05-25 at 01:14
Hi sam,
Can we use Z270 board + Intel Boxed Core i7-6850K Processor
together ?
Reply
Tom says
2017-05-25 at 01:17
Thanks, Sam.
Reply
Martin says
2017-05-10 at 13:32
Hi!
First of all a big thanks! You’ve basically created the best resource for deep learning
enthusiasts looking to build their own machine.
My computer is going to be situated close to my bed so one priority is noise. This

brings me to the first issue of whether or not to get a reference (Founders Edition)
1080 Ti or not, as they generally seem to be louder than their OEM counterparts.
There seems to be a debate around performance and quality between reference
and OEM cards. From what I can gather, most people who are doing long-running
computations rather than gaming, especially on multi GPU setups, favor reference
cards due to their fan design which blows air out the back rather than just circulate
air inside the case. I’m starting with one card and plan to add a second one later,
and I doubt I’ll get more than 2 cards for a while.
For the CPU, I was first looking at Intel i7 6850k, which was the cheapest i7 I could
find that supports 40 lanes. However, Intel Xeon E5-1620 V4 is almost half the price
and also supports 40 lanes. Not sure if the faster i7 is worth the money here?
Lastly, I was thinking about getting a water cooler for the CPU. I’ve read mixed
opinions about water cooling, but I reckon moving air outside of the case should be
a good thing as it allows the GPUs to run at lower temperatures?
Here’s the build: https://pcpartpicker.com/list/gzfgxY
Any suggestions highly appreciated!

Reply
Thomas says
2017-05-06 at 15:50
Hello Tim.
Thank you very much for this blog. You gave me solid background for my
understanding of dependency between hardware and deep learning.
I have a question about bus speed in CPU. Should that be a concern ? As you
wrote the true bottleneck is between cpu and gpu, and as i understand the “Bus
speed” which is at ark.intel white sheets refers to that connection.
I have to choose between E5-262x v4/3 or E5-16xx v4/3. The 262x family have bus
speed set at 8 GT/s QPI while 16xx have 5 GT/s. (8 GT/s is what PCIe 3.0 offers)
Besides the Scalability, clock frequency and memory bandwidth that is the
difference between them, and the only one that matters in deep learning since all
of them have clock speeds above 2GHz, memory bandwidth over 68 GB/s and I will
not make use of Scalability.
This link refers to the possible comparison of those

models: http://ark.intel.com/compare/92980,92986,92994,92987
Reply
Tim Dettmers says

2017-05-10 at 13:10
It is correct that this is the main bottleneck between the CPU and GPU,
however a very tiny amount of time in spend between CPU-GPU interactions
on a memory level compared to the actually GPU computation. It becomes
more relevant if you have multiple GPUs, but for multiple GPUs the main
bottlenecks are somewhere else. Currently a good CPU (in terms of bus speed)
will improve your deep learning performance by about 0-1.5% compared to a
“standard” one and I would not worry about it too much. I think all the CPUs
that you linked are more than fine.
Reply
Adarsh says
2017-05-05 at 13:57
Hey Tim, Nice article. I have a question:

Lets suppose I train a model for a particular task (classify an image in class 1 or 2)
on a particular kind of input (800×400 resolution), I want to choose a GPU with
MINIMUM number of cores and power that would give me the result in under 100
mili seconds. How to estimate this without running the model on GPU? Is there a
relation between no of cores of the GPU and the performance of a deep learning
model?
Thanks a lot in advance.
Reply
Tim Dettmers says

2017-05-05 at 14:05
The speed of the computational units between different GPUs of the same
series are about the same (NVIDIA Titan Xp modules are not much faster than
say, GTX 1060 modules), but the reason why bigger cards is faster that they just
have more modules (called stream multiprocessors or SMs). If your model is
computationally not intensive, then benchmarking some small GPUs and
extrapolating the number of SMs might be a valid option to find which is the
optimal GPU in this case.
For operations which saturate the GPU such as big matrix multiplications or, in
general, convolution this is very difficult to estimate. It sounds like you want to
reduce costs. A good way to do this is also through power efficiency and this is
a very transparent option which can be easily optimized. It also sounds like you
want to reduce latency — this is very difficult to test because computationally
graphs differ too widely; the only option that I see is to find people that have
these GPUs and let them run benchmarks on your model. Or otherwise, try to
generalize existing benchmarks for your model.
Reply
Adarsh says
2017-05-05 at 14:24
Can you give me some insights as to how I can extrapolate the

benchmarks. Let us assume I have a GPU of 4 SMs and 4GB Global
Memory on Pascal architecture which gives me X mili second avg
classification time. Theoretically, Can I expect timing of X/2 mili seconds
with a GPU having 8 SMs and 8 GB Global memory.
In my experience performance does not varies in linear fashion, How to

estimate timings while extrapolating and interpolating the GPU Specs(No.
of SMs, Global memory, Memory bandwidth etc)?
Thanks
Reply
Michael says
2017-05-05 at 18:46
Adarsh, it’s hard to give you any advice, because you didn’t tell us
anything about what you’re trying to do exactly: what is your accuracy
target (e.g. on ImageNet)? What is your power budget?
People have run VGG and Inception on an Iphone 6s, with 150-300ms
latency:
http://machinethink.net/blog/convolutional-neural-networks-on-the-
iphone-with-vggnet/
If you have some embedded application in mind, then your best

option is Jetson TX2 dev kit. It should definitely be able to run latest
Imagenet networks under 100ms latency.
See this nice paper which tested the previous Jetson kit:
https://arxiv.org/abs/1605.07678
It also provides some insight into the relation of amount computation
and accuracy.
Reply
Charles U says
2017-05-01 at 06:52
Hi Tim,
Thanks for this great article, I also read your other one on GPU performance. I’m on
a budget right now so planning on buying a GTX 1060 6G, with the intent on
upgrading in the future.
In this post, you mention your computer should have at least equal RAM as your
GPU. Does that mean it would make more sense for me to buy a 6G RAM
computer to match my card size ? I was originally planning to get a 4G RAM
computer. And in the future, if I get a 1080Ti with 12G, will I have to upgrade my
computer to 12G RAM ?
Thanks
Charles
Reply
Tim Dettmers says

2017-05-02 at 13:38
This requirement is not so strict; I should update my blog post on this. If you
have 4GB RAM you will be able to work with most datasets if you stream your
data, that is load them in small batches bit-by-bit. If you do this 4GB will even
suffice for the GTX 1080Ti. You might run into some problems if you run very
large RNNs, but this can be prevent with some code which initialize weight
directly on the GPU rather than CPU. You might also run into problems when
you preprocess data, but also this can be managed with some extra code. You
should be fine with 4GB.
Reply
Nikos Tsarmpopoulos says

2017-04-29 at 01:11
Hi Tim,
I’m reading through each and every single article of yours, they are to-the-point
and very helpful for beginners in neural networks, like myself.
Regarding PCIe 3.0, I’ve noticed that most consumer-grade motherboards fall back
to x8x8x16 for three GPUs and x8x8x8x8 for four GPUs. Hence, in a multi-GPU
setup, where a different CPU thread handles each GPU, it’s not possible make use of
the GPUs’ x16 lanes capability. Notably, PCIe 3.0 x8 has the same theoretical
throughput as PCIe 2.0 x16.
While searching for motherboards with more PCIe lanes, I noticed that some new
consumer-targetted motherboards come with a PEX 8747 Broadcom PCIe Bridge.
That’s a 48 lanes bridge, which is still insufficient for non-synchronised, concurrent
data transfers. Broadcom’s top of the line bridge supports 96 lanes (no idea how
this solution costs): 64 lanes could be used for 4 GPUs and 32 additional lanes for
the CPU, which means GPUs can communicate with each other using the full PCIe
3.0 x16 bandwidth and up to two GPUs can concurrently transfer data from/to the
CPU at full bandwidth.
Have you considered these solutions? Are you aware of motherboards that deliver
sufficiently good value for money, e.g. achieving performance that would costs a
less than alternative solutions, to justify the cost?
Thanks in advance.
Reply
Tim Dettmers says

2017-04-29 at 13:37
Hi Nikos,
thanks for your comment. I also stumbled upon these switches, but in the end
they are probably not so suitable for deep learning. The details are a bit difficult
to understand but let me try to explain: The problem with these solutions is
that they still use the underlying PCIe interface and thus are limited just like
normal PCIe transfers. In most graphics applications you do not have parallel
GPU-to-GPU transfers, but GPU-to-GPU transfers which are slightly off-set in
time and also small in size. Under such circumstances you can have clever
protocols and extra lanes which feed into the usually attached lanes (which
have a hard limit to 16 per GPU) in a safe in secure manner without blocking
the channels. In other words, with these switches you can send multiple packets
asynchronously and securely but each GPU still receives one packet at a time; in
a normal switch each packet must be scheduled after all other packets on that
path have completed or otherwise one has insecure transfers (which can
corrupt the data).
The crux is that in deep learning, or generally in computing, you do many

parallel transfers at the very same time and usually these packets are large.
With this new fancy switch you can start the transfer of the packets
asynchronously but they will still block each other for access to the GPU
(because it takes quite some time to send the full package). This means that
instead of blocking before the transfer you now have blocking during the
transfer. I am not sure about the performance in this case, but I could imagine
that the performance is the same or even worse than with normal switches.
The reasoning behind these switches is that they trade the synchronization of
the full PCIe path with the synchronization of sub-paths on the PCIe circuit (the
sub-path to the GPU) which increases performance for many applications,
especially graphics applications, but probably not for deep learning.
Hope this helps!

Reply

2017-05-04 at 01:33
Hi Tim,
Thanks for your response, as always very informative.
I wrote to Broadcom, to ask for additional information regarding their

particular PCIE switches and they came back with very interesting
feedback.
My understanding from their response is that broadcom’s aforementioned

80- and 96- lanes switches should allow for more efficient GPU – to – GPU
communication at full x16 lanes speed (per pair), compared to the x8 lanes
currently supported in triple and quadruple GPU configurations via a
40lanes CPU.
However, they also implied that these [current generation] switches

communicate with the cpu via a 16lanes connection, ie the cpu cannot
establish two 16lanes connections with corresponding GPUs in parallel via

the switch. Multicasting data wouldn’t be affected.
I’m looking into reconfirming this with Broadcom, as there might be a

benefit in using a motherboard with these switches in multiGPU configs.
Kind regards,
Nikos
Reply

2017-05-04 at 02:45
Reading through your response again, I realise that concurrent access of

four GPUs from the CPU, for transmission of large amounts of data, via a
PCIe switch that features 16lanes upstream, would result in half the
bandwidth of a CPU’s 8lanes with each GPU.
Reply
Tim Dettmers says

2017-05-04 at 18:05
Indeed, these switches can be very complicated and I am not sure

about every detail. If you can gain a bit more insight these it would be
great if you can share it here. Thanks!
Reply

2017-05-14 at 14:53
Hi Tim,
Following up from my previous message, I have now confirmed with

Broadcom that, due to a contraint imposed by the PCIe specification,
its PCIe switches feature a total of 16 lanes on the upstream port, i.e. to
the CPU.
The downstream ports are non blocking, i.e. when using a PCIe switch
of 80 lanes (64 lanes for the GPUs + 16 lanes for the upstream
connection to the CPU), pairs of GPUs can talk to each other directly,
using 16 lanes per pair.
If we use 4 GPUs, a single GPU can broadcast data to the others at full
x16 speed (versus x8 speed if they were attached directly to the CPU’s
PCIe lanes).
Also, the CPU can broadcast data to all four GPUs using the full x16
throughput.
Two pairs of GPUs can exchange data at full x16 speed (again, versus
x8 speed if they were attached directly to the CPU’s PCIe lanes).
The downside is lower concurrent (non-broadcasted) data throughput

from the CPU to the GPUs, where 16 lanes will be shared, delivering the
equivalent of x4 throghput per GPU (versus x8 speed, if …, as above).
Thus, it really depends on how much unicast data needs to be

transferred -concurrently- between the CPU and the GPUs versus
between the GPUs.
Depending on what proportion of the time it takes to train a deep

neural network is spent on (a)unicasting data -concurrently- from the
CPU to the GPUs , (b) broadcasting data from the CPU to the GPUs, (c)
unicasting data between pairs of GPUs and (d) broadcasting data from
one GPU to the rest, a multi-GPU might benefit from a PCIe switch
(also called PLX).
My understanding is that a motherboard with such a switch costs

£200-£300 more than normal motherboards.
What do you think?

Reply
Tim Dettmers says

2017-05-15 at 18:02
That is a pretty good insight, thanks Nikos!
So one will get improved performance from such a system.

However, for bigger systems it is common to pool the data on the
CPU to perform more complicated, tree-like broadcasts through
the network. I do not think such broadcasts are currently
implemented for GPU memory (this feature might have been
added since the last time I checked, which was more than a year
ago). I think such a motherboard with the 64 GPU lanes would be

optimal in a 4 GPU setup. For a multi-node setup it might still be
helpful, but probably too expensive to justify the costs, and for big
systems reliability is often more important than squeezing out the
last bits of performance.
All of this also depends on the type of algorithm that one uses
though, but it is good to know that these motherboards can
improve performance! Thanks again!
MacMinus says
2017-04-28 at 11:11
Since we are now more than 2 years down the line, and Moore’s law has been
doing its thing, I would be curious about an update to this great piece with the
current HW (e.g. multi-GTX 1080 Ti’s).
Reply
Tim Dettmers says

2017-04-29 at 13:45
The general hardware recommendations did not change very much and I think
I would make the same recommendations that are listed here. If you are
interesting in GPU recommendations you can read my other blog post about
GPUs.
Reply
Petra-Kathi says
2017-04-19 at 10:01
Another great thank-you from my side as I (hope I) have gained a lot of insight into
the hardware setup of deep learning workstations!
Presently I am in the position of defining a deep learning workstation for internal

research purposes in a professional environment (i.e.: full-grown Windows IT
environment :-/ with a dedicated server room, requirements to use only certified
hardware and such). From the pure deep learning research approach I would have
opted for a system with 3 or 4 GTX 1080TI cards – well aware of the problem of
parallelization, but at least providing ample computing power for independent
parallel jobs to come to their ends in sensible absolute time slices. But the only
certified offer in the strived-for power range we got was a system with 2 P6000
cards.
As this card is quite new and mentioned here only once and without further
discussion w/r/to potential deep learning issues, let me please ask for hands-on
experiences in the directions of libraries support, half precision deep learning
computational power, and potential further quirks I may not have thought about? I
am well aware that the P6000 is sub-optimal cost-wise, but anything not certified is
a no-go in this environment.
Your hints are very appreciated!
Reply
Tim Dettmers says

2017-04-21 at 12:31
The P6000 is based on the GP102 ship, which is very similar to a GTX 1080 Ti
and Titan X Pascal. The features and performance will be similar to these cards,
that is, you will have usual support of all deep learning libraries, good
computational power, but almost no half-precision performance. So with that
card you will receive a powerful GPU which you can use in your certified
environment. If the cost difference between the P6000 and P100 is slim, you
might want to opt for the P100 with which you gain a bit of performance and
half-precision computation. However, if the difference is larger then just go with
the P6000.
Reply
Michael says
2017-04-21 at 17:59
I’d definitely recommend going with Quadro GP100 or Tesla P100:

https://exxactcorp.com/index.php/product/prod_detail/2048
https://exxactcorp.com/index.php/product/prod_detail/1662
P100 should provide double the performance of P6000 for deep learning,
with effectively the same or more of half precision memory (24GB or
32GB).
Talk to Exxact folks, I had a very positive experience with them.
Reply
Petra-Kathi says
2017-04-25 at 10:13
Tim and Michael,

thanks for your comments! Pitily the cost difference seems to be anything
but “slim”, so I expect the final system to contain two P6000, at least for
starters. The Exxact product line will certainly be compared to local
offerings.
Reply
samihaq says
2017-04-11 at 08:12
Can anyone please look at my almost final rig, and suggest any improvements or
inform about any blunder which i am about to make, please, and especially about
any useless money spend which i am already very short of having. The only aim is
to have a solid reliable rig that can serve 24/7 for long time for around 2500-
2600$. Thank you very much. Regards.
https://pcpartpicker.com/list/BqRgHN
Reply
Petra-Kathi says
2017-04-19 at 10:11
Maybe you should consider one or two additional case fans? IIRC the power
supply fan pushes the air out. If you add another out-blowing fan somewhere
at the top and an aspirating one at the bottom this might improve heat
dissipation in 24/7 operation.
Reply
trulia says
2017-04-29 at 08:52
Does CPU have 40 lanes?
How many max GPU it can support?
Reply
Tim Dettmers says

2017-04-29 at 13:20
A CPU can have between 16 and 40 lanes. Read the specifications of a CPU
to see how many lanes that CPU has. Usually you will need at least 8 lanes
for a single GPU, but this is dependent on your motherboard. The CPU can
provide support for lanes, but they must be there on the motherboard. A
CPU can support a maximum of 4 GPUs.
Reply

2017-04-29 at 13:34
Would it be possible for a single CPU (assuming enough dual-threaded

cores) to handle 8 GPUs via a 96 lanes PCIe switch?
Reply
Michael says
2017-04-09 at 22:25
Sami, FYI: https://www.amazon.com/gp/customer-

reviews/R19IMBI0BXETDP/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&ASIN=B00MY3SQ84
Reply
sami haq says

2017-04-10 at 14:35
@Michael. Oh really. I am tired at searching for reviews and looking for things.
Can you please look at my built and suggest any improvements and especially
some mbo for max 300$ or give me some direction. Thank you.
Reply
Michael says
2017-04-10 at 17:47
Sami, for my last 3 workstations, I didn’t bother building them myself. I sent
the desired specs to several system builders, and then negotiated the price
down. In the end, I only paid a few hundred bucks more than what it would
cost me to do it myself.
For your budget, I would buy a used computer, 2-3 generations old, and
get a couple of 1080 Ti cards.
Reply
samihaq says
2017-04-11 at 08:07
Thank you for the info. Regards
Reply
samihaq says
2017-04-11 at 08:19
Thank you for info. EVGA informed through email that the
motherboard has been tested for Intel Xeon E5-1620 V3 and not the
v4.0. So thanks to you for informing me abt it and i have changed the
cpu from v4 to v3, both are almost same. Here is the reply from
EVGA:-
“Hello,
Thank you for the email so, unfortunately, EVGA hasn’t tested the
newer Xeon CPU like the V4. The only tested is the V3 these are only
we have tested that has supported the X99 motherboard with the
latest bios update. I apologize for the inconvenience.
Xeon® E5-1680 V3 3.20 GHz 40 1.14

Xeon® E5-1660 V3 3.00 GHz 40 1.14
Xeon® E5-2695 V3 2.30 GHz 40 1.14
Xeon® E5-2697 V3 2.60 GHz 40 1.14
Xeon® E5-2670 V3 2.30 GHz 40 1.14
Xeon® E5-2660 V3 2.60 GHz 40 1.14
Xeon® E5-2687W V3 3.10 GHz 40 1.14
Xeon® E5-2687W V3 3.10 GHz 40 1.14
Xeon® E5-2685W V3 2.60 GHz 40 1.14
Xeon® E5-1650 V3 3.50 GHz 40 1.14
Xeon® E5-2667 V3 3.20 GHz 40 1.14
Xeon® E5-2630L V3 1.80 GHz 40 1.14
Xeon® E5-2609 V3 1.90 GHz 40 1.14
Xeon® E5-2609 V3 1.90 GHz 40 1.14
Xeon® E5-1620 V3 3.50 GHz 40 1.14
Xeon® E5-2643 V3 3.40 GHz 40 1.14
Xeon® E5-1630 V3 3.70 GHz 40 1.14
Xeon® E5-2603 V3 1.60 GHz 40 1.14
Xeon® E5-2620 V3 2.40 GHz 40 1.14
Xeon® E5-2640 V3 2.60 GHz 40 1.14
Xeon® E5-2623 V3 3.00 GHz 40 1.14
Xeon® E5-2637 V3 3.50 GHz 40 1.14
Regards,
EVGA”
Reply
sami haq says

2017-04-08 at 09:06
Hi, i am into the deep learning and currently have k5100 Quadro gpu with 8gb of
memory in a laptop with compute of 3.0. I want to make a solid DL rig which can
serve me good for atleast 4-5 years with heavy work load. After reading the
fantastic blog by Tim, i have selected the following using Pcpartpicker. As my

budget limit is max around 2400-2500$, so basically i have gone for budget cpu
but better gpus. I would go for one 1080ti and min one or max two 1070s.
Can anyone please look into my built and suggest any improvements. Also one
thing i am confused about is whether i should go for
Asus X99-DELUXE II ATX LGA2011-3 Motherboard 394$
or
Asus X99-A/USB 3.1 ATX LGA2011-3 Motherboard $228.88
Does going for deluxe Mbo with almost 180$ more is justified?
Another thing i am confused about whether founder’s edition of the Gpus by EVGA
or any other vendor is good enough or shud i go with customized with more fans
but ofcourse will cost more.
My built is
Intel Xeon E5-1620 V4 3.5GHz Quad-Core Processor (40 lanes) 286.99
Cooler Master Hyper 212 EVO 82.9 CFM Sleeve Bearing CPU Cooler $24.88
Asus X99-A/USB 3.1 ATX LGA2011-3 Motherboard $228.88
Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory $219.99
Western Digital BLACK SERIES 2TB 3.5″ 7200RPM Internal Hard $122.88
EVGA GeForce GTX 1070 8GB SC GAMING ACX 3.0 $374.00
EVGA GeForce GTX 1080 Ti 11GB F ounder Edition $700.00
Corsair Air 540 ATX Mid Tower Case $119.98
EVGA SuperNOVA G2 1300W 80+ Gold Certified Fully-Modular ATX Power Supply
$182.03
Asus DRW-24B1ST/BLK/B/AS DVD/CD Writer
Total $2278
Thanks
Reply
Tim Dettmers says

2017-04-08 at 15:58
Hi Sami,
I so do see why the $180 would be justified; the board adds 1 PCIe slot but give
pretty much the same deep learning performance.
Please note that if you have two GPUs of different chipset for example a GTX
1070 and a GTX 1080 you will not be able to parallelize them.
Often the coolers on the GPUs are quite similar in performance so that it
should not be a big deal. However, I am not so familiar with the current fan
designs and there might be a fan which is superior to others. I probably would
pay $20-30 if the fan performance is > 33% better, but not more. I do not think
it is worth it at a certain point – better to save that money to buy another GPU
in the future.
Hope this helps

Reply
sami haq says

2017-04-09 at 06:44
Thanks for the useful info. I did’nt know abt the parallelization of gpu issue
with different chipset. As these gpu are not cheap to get so i was thinking
that i will use a 1080ti for big networks, while using one or two 1070s for
small prototypes for params or checking different options in parallel on
relatively small scale.Do you believe in my logic or should i prefer
parallelization(by having same gpu’s, in which case i can afford max two
1080ti) over my current view??
My second question is regarding Asus motherboards, they all have very

bad reviews on newegg.com by the verified owners due to dead mbo etc .
As i am not based in US and getting these purchases through a US based
friend so i cant avail warranty etc. Do you have any personal experience of
a motherboard which u can recommend that can serve me 24/7 for long
time, or can you suggest any particular brand, please.
Once again thanks for you info.
Reply
Tim Dettmers says

2017-04-09 at 10:49
Ah I understand, using a GTX 1080 Ti and a GTX 1070 makes sense if

you use them in that way.
There are in general no reliable motherboard manufacturers, but

certain specific motherboard versions are more stable than others. I
think to orient yourself along newegg reviews is a good idea. For
example, I would not buy the motherboard that you linked due to their
bad reviews. However, I am not current on the market situation for
motherboards, so you have to find a good motherboard by yourself.
Usually using pcpartpicker, selecting the X-SLI option where X is how
many GPUs you want to have and then sort the price and pick the first
option with good reviews is a sound strategy to find a good
motherboard.
Reply
sami haq says

2017-04-09 at 11:41
Thank you Tim. I have selected

EVGA X99 Classified 151-HE-E999-KR LGA 2011-v3 Intel X99 SATA
6Gb/s USB 3.0 Extended ATX Intel Motherboard. It has 5 x PCI
Express 3.0 x16, 4 WAY SLI and great reviews on Amazon as well as
newegg and is around 300$. I believe EVGA is a good brand and
hope it serves me better for this. For the gpu, i will consider for
similar chipset as you suggested bcoz in near furture, i guess all
the libraries like theano, tensorflow, pytorch etc will have to
support parallelization.
Thanks
jennifer lewitz says

2017-04-03 at 09:59
It is truly a nice and helpful piece of information. I’m satisfied that you just shared
this useful information with us.
Please stay us informed like this. Thanks for sharing.
Reply
Tim Dettmers says

2017-04-03 at 12:45
Thank you, I am happy that you found the blog post useful
Reply
Umair says
2017-03-30 at 07:50
Hey Tim
Some of the links here direct to an old WordPress blog. Is that content unavailable
now?
Reply
Tim Dettmers says

2017-03-31 at 15:28
All of my content has been moved to this blog so you should find it here. I was
not aware that there were some old dead links in this blog post. Thank you of
making me aware of that. I will clean that up in the next days.
Reply
Nader says
2017-03-19 at 18:39
Please help
Do you recommend getting an Alienware amplifier with an Alienware laptop with a
GTX 1060 for portability and the amplifier with a GTX 1080ti for the amplifier and a
station
Please help
Reply
Chris says
2017-03-14 at 02:16
Hi,
someone else above had a similar but not exact the same question, hence I would
like to ask for your opinion as well
I understand that it would be optimal to have a CPU with enough native PCIe lanes
to connect to every GPU with 16 lanes. Given that I would like to build a system with
not more than two GPUs, I would need 32 lanes for the GPUs to avoid PCIe
bottlenecks. Currently that yields to socket 2011-3 CPUs (Broadwell) with 40 lanes.
If I would, for reasons of cost, use a socket 1151 (Kaby Lake) setup with a 16 lane
CPU but with a mainboard offering a PLX switch that can offer 2 PCIe x 16 slots, one
questions arises: Do the GPUs need the whole PCIe bandwidth permanently, forcing
the PLX switch to permanently share the 16 x bandwidth into 8 x / 8x or is it more
likely that the GPUs transmit in an interleaving manner with full x 16 bandwidth
available to the currently transmitting one. My guess is, that the truth would be
something in between but I have no exact numbers or benchmarks. Do you have
some experience regarding actual bandwidth loss or suggestions here, is it
beneficial to use a PLX switch in 16 lane CPU, dual GPU configurations or should I
definitely go for a 40 lane CPU?
Cheers,
Chris
Reply
Tim Dettmers says

2017-03-19 at 18:02
Hi Chris,
there are some motherboard which support the 16x speed when no transfers to
the other GPU is executed, but this is rare. In general you will have 8x / 8x
speed. Check your motherboard specs for this.
I would not worry too much about PCIe lanes. If you want to parallelize GPUs it
will be a performance hit, but you would still get good speedups. If you use
good parallelization algorithms, like those provided by Microsoft’s CNTK, then
you will have no performance hit. If you use the GPUs separately you will see
almost no performance hit. So I would just go ahead with that setup. It will
probably give you the best bang for the buck.
Reply
Dhaval says
2017-03-12 at 18:44
Should I take Zotac Nvidia GT730 cause I don’t have much money and can spend a
max of 5000 INR. Any suggestions sir?-
Reply
Tim Dettmers says

2017-03-19 at 18:16
The GT 730 variant with GDDR5 memory is a good choice in that price range.
The DDR3 variant will be much slower so pay attention to the memory. The
memory is just 1 GB in this variant, but if you use 16-bit networks you can do
some experiments with this. If you need to train larger networks then the DDR3
variant with larger memory (up to 4GB) will be a good choice too. You will have
to wait for experiments a bit longer, but you will be able to run most models if
you use 16-bit and you will get a speedup over using the CPU.
Reply
James says
2017-03-09 at 12:28
Yes it looks like Titan X and new GTX 1080Ti have basically the same specs, but
almost half price for 1080:
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_
series
I’d nearly ordered a titan x only to find them now out of stock in most retailers.
Is there something fundamentally different about the 1080 vs Titan where deep
learning is concerned? Otherwise it looks like you could build a devbox clone for a
decent price.
Reply
Tim Dettmers says

2017-03-19 at 18:22
Definitely go for the GTX 1080 Ti. The 1 GB memory difference is not significant
for most use-cases.
Reply
ElXDi says
2017-03-06 at 22:40
Hi Tim and all Deep Learning guys!

I have a i7-4790K CPU with 32Gb of RAM which should be fine for the beginning.
I’m planning to buy new GPU. I have a few options. 1060s is best seller and best
chose price/performance, I guess. BTW here is list of non reference design
PCBs http://thepcenthusiast.com/geforce-gtx-1060-compared-asus-evga-msi-
gigabtye-zotac/
1. GTX 1060 3GB reference design 200e
+ really cheap
– poor over clocking results
– just 3GB of VRAM
2. GTX 1060 3GB non reference design (the PCB is based on GTX 1080 with better
power feed and 8pin connector) 250e.
+ performance boost +5%
+ ability to over clock with volt mode
– price
– just 3GB of VRAM
3. GTX 1060 6GB ref design 260e
+ more CUDA cores
+ more VRAM
4. GTX 1060 6GB non reference design (the PCB is based on GTX 1080 with better
power feed and 8pin connector. 280e – 300e
+ more CUDA cores
+ more VRAM
+ really good over clocking ability (+15%)
– quite expensive
– price / performance index is not so good any more
5. GTX 980 4Gb used with 1 year warranty 250e

+ more CUDA
+ 5% more performance than 1060 6GB version

– used
– less VRAM
– more power consumption
So what you think about above options? What is more important more VRAM or
CUDA core number or GPU clock speed or VRAM bandwidth?
Thank you for sharing your experience!

Reply
Tim Dettmers says

2017-03-06 at 23:04
If you have missed it you might want to check out my other blog post about
GPU selection: GPU advice. To reiterate the points:
– Bandwidth is the thing that you want to have the most of
– The best GPU in terms of cost/performance is the GTX 1070 (and soon also
the GTX 1080 Ti)
– GPU memory size is important; but for many tasks 8GB is fine. If you want to
computer vision research get a 12GB GPU
To answer other questions: CUDA core number and clock speed are not that
important. Overclocking will give you almost no performance increase for deep
learning.
Hope that helps!
Reply
ElXDi says
2017-03-07 at 00:47
Thank you very much for your answer. Your answer really helps me.
As far as I understood the 3GB model is really useless. So the 1060 6GB is
fine for beginning and 1070 8GB is minimum for any real project and Titan
X 12 GB is required for something real.
Cheers!
Reply
om says
2017-03-07 at 01:27
Titan X and GTX1080 ti have only 1GB difference in memory but in
price big difference.
Does anyone know why?
http://www.eurogamer.net/articles/digitalfoundry-2017-gtx-1080-ti-
finally-revealed
Reply
Ashley says
2017-03-06 at 22:28
@Michael – bummer it’s done.
Reply
Ashley says
2017-03-07 at 00:04
I found this: https://annarbor.craigslist.org/sys/6031393982.html but I am

getting a much better rig and half the price..
Reply
Michael says
2017-03-07 at 01:47
@Ashley: I’d probably just get this one (after getting the price down to
$300, or $350, tops):
https://annarbor.craigslist.org/sys/6031436427.html
The advantage is it’s already got 1050 card in it, so you can start doing DL
right away. Later, if you realize you need more power, you can buy 1080 Ti,
and will still be within your $1k budget.
Reply
Ashley says
2017-03-07 at 18:28
Thanks for the advice.
Reply
Ashley says
2017-03-05 at 02:54
Hi,
Complete noobie build here.. So all aspects of all things computer needed.. I have
been using a laptop till now and I would like to build a reasonably priced PC that
can run CNNs. If needed I can tunnel in from anywhere to work with etc.
This is what I have.. would really appreciate any comments – have I missed
anything?
Intel Core i5-6500 3.2GHz Quad-Core Processor

Corsair H60 54.0 CFM Liquid CPU Cooler
MSI B150 PC Mate ATX LGA1151 Motherboard
Kingston HyperX Fury Black 8GB (1 x 8GB) DDR4-2133 Memory
Western Digital Caviar Blue 1TB 3.5″ 7200RPM Internal Hard Drive
Asus GeForce GTX 1060 6GB 6GB Turbo Video Card
Phanteks ECLIPSE P400S TEMPERED GLASS ATX Mid Tower Case
Corsair CXM 550W 80+ Bronze Certified Semi-Modular ATX Power Supply
TP-Link TG-3468 PCI-Express x1 10/100/1000 Mbps Network Adapter
Gigabyte GC-WB867D-I PCI-Express x1 802.11a/b/g/n/ac Wi-Fi Adapter
Thank you!
Reply
Tim Dettmers says

2017-03-06 at 11:21
Looks a solid build which offers some opportunities for upgrades in the future.
If I would do more data science I would probably go with cheap or used DDR3
CPU/RAM combo and buy more RAM (32-64GB); possibly I would swap the
GTX 1060 for a GTX 1070 if I have the spare money left from switching from
DDR4 to DDR3. If I would do more deep learning I would also go for a DDR3
CPU/RAM combo, possibly buy used hardware, and then buy a GTX 1080 Ti.
This does not mean that your build is bad. Your build is more future proof. My
build would be more “I-want-to-do-things-now”. I guess this depends on taste,
but be aware of what you want to buy when you buy hardware. Do you want
to buy data science, deep learning, machine learning, Kaggle competitions, or
being future proof? Your build buys all of that a little and a lot of being future
proof, which can be a very sensible choice.
Reply
Ashley says
2017-03-06 at 15:51
Hi,
You are awesome – thank you for the quick reply (because I have to get
the laptop I am working with back asap)
– I want it for deep learning & machine learning primarily, either at the
workstation or through a laptop that I can tunnel in with when needing a
change of environment.
– I need it to last because I may not have another chance to buy anytime
soon.
– In case this matters? I will be using Linux, probably Ubuntu flavour. It was
challenging installing on ROG – had to use rpm for some reason.
If you don’t mind:

Where do I get used hardware from?
I can’t find the GTX 1080 ti on pcpartspicker.. its coming out this week looks
like? Would you know the best place to get it through?
Reply
Ashley says
2017-03-06 at 16:07
Oh – and should I look at an SSD for base installation? if so will I get

away with one that is say 125GB?
Reply
Ashley says
2017-03-06 at 21:27
Final build – for now:
Intel Core i5-6500 3.2GHz Quad-Core Processor 198.68 (w

shipping)
Corsair H60 54.0 CFM Liquid CPU Cooler 59.99
MSI B150 PC Mate ATX LGA1151 Motherboard 84.78
Kingston HyperX Fury Black 8GB (1 x 8GB) DDR4-2133 Memory
68.99 (I tend to pull out memory and use it for other builds so
kept the newer version)
Western Digital Caviar Blue 1TB 3.5″ 7200RPM Internal Hard Drive
49.99
Gigabyte GeForce GTX 1070 8GB Windforce OC Video Card 369.99
Rosewill GRAM ATX Mid Tower Case 49.99
Rosewill 600W 80+ Bronze Certified Semi-Modular ATX Power
Supply (had to up for the new GPU) 59.99
TP-Link TG-3468 PCI-Express x1 10/100/1000 Mbps Network
Adapter 9.22
D-Link DWA-552 PCI 802.11g/n Wi-Fi Adapter 9.95
Logitech K120 Wired Standard Keyboard 9.00
I am going to use a 32″ TV – hope that doesn’t kill my eyes..

And have a little mouse and speaker.
Total cost $970.57 (using used where possible)
Michael says
2017-03-06 at 22:10
Ashley, no, this is not how I’d spend a thousand bucks if I needed a
cheap machine for DL. Instead of getting all these parts
individually, I’d shop for a decent used desktop, then buy a good
video card separately. For example, something like
this: https://santabarbara.craigslist.org/sys/5992606383.html
Then you will have enough money left for GTX 1080 and more.
The truth is, CPU performance haven’t improved that much in the
last 5 years, so for deep learning an old CPU + 1080 will be faster
than a new CPU + 1070.
Also, you should get a SSD. Again, old CPU + SSD will be faster
than new CPU + hard drive.
p.s. and you definitely don’t need a liquid cooler (nor any
overclocking).
Nader says
2017-03-03 at 18:46
Is the NEW Ryzen 1800X compatible with the GTX 1080ti ?

Will using theano as the backend work ?
Thank you
Reply
Andrew says
2017-03-05 at 09:04
Ryzen will work perfectly fine with a 1080 TI. However depending on your work
load Ryzen may not be the best option.
The Pro’s and Con’s of Ryzen 1800X is:
Pro: Ryzen has ECC RAM support which is great for mission critical situations
where data CAN NOT risk being corrupted at any cost. However if you are
mainly doing Deep Learning then ECC RAM is not really necessary at all as
most Deep Learning algorithm’s and AI training etc.. can be done on 16 bit or
even 8 bit precision (which is something that the TITAN X pascal excels at
actually, thus why something like a Quadro or Tesla isn’t necessary either in
most cases)
Pro: Ryzen has 8 cores, which are beneficial if you plan to do work with highly
multi-threaded programs for video editing, 3d rendering etc.. although in many
of these cases you are better off using GPU acceleration instead of relying on a
CPU since CUDA acceleration on an Nvidia GPU (especially a TITAN X) will be
FAR faster than ANY CPU. And again, if you are just doing Deep Learning
mostly, or maybe some PC Gaming on the side etc.. and aren’t doing programs
that need all those extra cores (Deep Learning only needs 4 cores even for four
way SLI in most cases as shown in this article) then the extra cores of Ryzen are
redundant frankly.
Con: Ryzen is limited to dual channel RAM. This cuts your memory bandwidth
in half pretty much which CAN effect intensive Deep Learning work somewhat.
It also only supports up to 2666-2900mhz RAM speeds in many cases which
isn’t really a big deal for Deep Learning but will effect any memory intensive
workstation/proffesional tasks. It also has a RAM capacity limit of 64GB
compared to Intel X99 chipset used with CPUs like the i7 6800K etc.. that allows
for 128GB of QUAD channel RAM clocked at up to 3600mhz. It’s up to your
situation whether you consider that a problem or not.
Con: Ryzen has no overclocking capability to speak of. Nobody has really been
able to get ANY Ryzen chip to get over 4.1ghz; with many even being stuck at
3.9gh zor 4.0ghz (which in the case of the 1800X is literally NO overclock at all
since the 1800X runs at 4ghz out of the box). So if you are using programs that
need clock speed then a faster chip would be beneficial.
So overall unless you really have specific need of an 8 core chip, i would say for
a Deep Learning PC, even if you do things like normal web browsing, heavy PC
gaming, video streaming/encoding etc.. you might be better off getting
something like an i7 6800K (which has 6 cores 12 threads but can hit 4.4ghz in
some cases so overall is a bit better) which is $100 cheaper than the R7 1800X;
or perhaps the i7 7700K which is only $329 ($170 cheaper than R7 1800X) and
can easily overclock to 5ghz with proper cooling (many people have hit 5.2ghz
even with just the high quality noctua air coolers or AIO water coolers etc..)
Only reason i would specifically get Ryzen is if you are really needing an 8 core
chip for specific programs, as Deep Learning and most general use doesn’t
require any more than 4 cores.
Reply
Michael says
2017-03-05 at 20:10
Keep in mind that 6800K has only 28 PCIe lanes (Ryzen and 7700k are even
worse), so if you’re planning to use multiple GPUs (now or in the future), go
with E5-1650 v4 (or E5-1620 v4 if you’re on a budget). Also, Skylake Xeons
are about to be released (this month), so if you can, wait for them (mainly
for AVX512 support).
Reply
tom says
2017-03-05 at 23:08
Hi Michael,
I would like to use 4 GTX 1080 Ti
Which is the best and cheap processor with motherboard ?
Reply
Andrew says
2017-03-06 at 04:22
Ah yes. Good catch.
Have you really noticed a difference between running a GPU with PCI-
e 3.0 x8 and x16 for Deep Learning though? In most other situations
i’ve seen having x8 PCI-e 3.0 isn’t hindering much at all, if any; you
sometimes see a 0.5% or maybe 1% performance delta between the
two but that’s typically it.
Reply
Michael says
2017-03-06 at 06:02
I haven’t seen this tested anywhere, but I’m guessing it’s important
for large networks running on fast GPUs, when it takes longer to
move gradients from GPU to GPU than to calculate them.
Tim Dettmers says

2017-03-06 at 11:22
I am always happy to answer comments, but it give me even more

joy to see that people answer each other’s questions. Thanks
Michael and Andrew!
Tim Dettmers says

2017-03-06 at 11:09
Yes, the AMD Ryzen CPU series will be compatible with your NVIDIA cards. In
general, all modern CPUs should support NVIDIA cards. This is so because the
CPU and NVIDIA communicate with a protocol that is in general used for
printers, network interfaces and so forth, and there is no CPU manufacturer
which can themselves not to support these features. Thus all CPUs should have
support for NVIDIA GPUs (at least those which come as PCIe cards, which are
all GPUs except the ones with NVLink, that is the NVIDIA P100 currently).
Reply
s12 says
2017-03-01 at 10:46
Hi Tim,
I have been looking into using NVLink to couple two TXPs. I was hoping to do this
in a SLI-like fashion (like shown here: http://www.kitguru.net/components/graphic-
cards/anton-shilov/nvidia-pascal-architectures-nvlink-to-enable-8-way-multi-gpu-
capability/ ), rather than buying a purpose-built motherboard. Unless I’m mistaken,
this isn’t currently possible — do you know if NVIDIA has any plans to implement
this in the future?
Thank you very much for this article and all of your helpful comments.
Reply
Tim Dettmers says

2017-03-06 at 10:52
If you are interested in parallelism I recommend looking into Microsoft’s CNTK

library. Their parallelisation algorithms, especially 1-bit quantization and block
momentum, are so good that you get linear speedups without having NVLink.
Granted, the software is a bit difficult to use but is maturing quickly and you
could save a lot of money by going without NVLink. I am currently not aware of
any affordable NVLink hardware which is used outside of supercomputing. You
might get your hands on one of those machines, but it will be expensive. So in
the end CNTK might be the only way to go which is practical. This may be
disappointing, but I hope it helps!
Reply
Ervin says
2017-02-04 at 14:42
Hello Tim and thank you for your post. I have currently a desktop with Core 2 Quad
Q9300. I was wondering whether it would bottleneck a GTX1060 6GB for some
beginner to mid DL problems?
Reply
Tim Dettmers says

2017-02-10 at 17:47
It is an old CPU, but you should be relatively fine. You can expect to run about
10-20% slower than with a high end CPU. Probably processing some non-deep
learning code, that is preprocessing data will take quite a bit more, but running
the deep learning model should be not much slower.
Reply
Ervin says
2017-02-11 at 22:22
Thank you for your reply. I also have an old motherboard GA-P43-ES3G
(http://www.gigabyte.com/Motherboard/GA-P43-ES3G-rev-10#sp) which
only supports PCI Express 2.0. I believe that will be a major bottleneck
right?
Reply
om says
2017-02-02 at 08:18
Hi Tim,
I have the latest mac and I want to use GPU – GTX Titan X
with https://www.akitio.com/expansion/node
AKiTiO Node – eGPU box Thunderbolt 3
My question is, can I use TensorFlow with this external GPU device, without killing
performance and efficiency. What could be the side effect?
I know using TitanX with the desktop will be a lot better but I need mobility.
Thanks and you rock

om
Reply
JP says
2017-02-02 at 15:55
On the akitio specs it says that Mac is not supported.
Reply
om says
2017-02-10 at 03:20
Just wanted to add – AKiTiO reply –
Hello trulia.
You have a new message from .

Re: tensorflow
Message: Hi Trulia, We currently do not have an eGPU solution for the Mac and
the only eGPU solution we have is for select Thunderbolt 3 PCs, so the answer
to your question is No. Having said that, it might be possible that you could
make it work but this would be more of a DIY project that requires hardware
and software modification. Also, it would void the warranty, so this is not
something that I can recommend. Regards, Stefan
Reply
Usher says
2017-01-20 at 11:41
Thanks Tim for your comment!
Reply
Usher says
2017-01-17 at 02:39
Hi Tim,
Could you comment on below build?
– Chassis: Corsair Carbide Air 540

– Motherboard: Asus ROG STRIX X99 GAMING ATX LGA2011-3 Motherboard
– Cpu: Intel Core i7 6800k
– Ram: 32GB DDR4 G.Skill 2400Mhz
– Gpu: 1 ASUS GTX 1080
– HD1: 500GB SSD Samsung EVO
– HD2: 1TB WD Red in RAID 5
I am not sure if the board is a good choice if I might be adding a second GPU in
the future. Or maybe ASUS X99-Deluxe II is worth the extra cost?
Reply
Tim Dettmers says

2017-01-20 at 11:12
Hi Usher,
I do not have time to check the details, but it seems that the motherboard is
okay. The review on newegg are not that good though, but the
cost/performance might still be good. Adding a second GPU will definitely no
problem with the motherboard that you chose.
Otherwise the build looks okay. I recommend checking the build with
pcpartpicker, which often finds compatibility issues if there are any.
Reply
Pavel says
2017-01-07 at 00:31
Hi Tim,
Very good article! Thank you!
P.S. You have cool working place.
Reply
Nader says
2017-01-06 at 16:16
So a single Titan Pascal trumps dual gtx 1080 in sli ?

Correct ?
Reply
Michael says
2017-01-05 at 22:56
Tim, thanks for the great article! I have a couple of questions:
1. What is “4” in your mini-batch size calculation (4x128x244x244x3)?
2. I’m deciding on which SSD to buy for my machine with four Pascal Titan X cards,
mostly to do training on Imagenet. Assuming your bandwidth estimate of 290MBps
is for a single card, should I multiply it by four when running a model on all four
cards? Do you know how fast Pascal Titan X processes a single 128 mini-batch?
Also, if I use mini-batch of 256 , I would need double the bandwidth, right?
Given the above considerations, would you recommend going with a PCIe based
SSD, such as Samsung 960 Pro, rather than SATA based one, such as Samsung 850
Evo?
Reply
Tim Dettmers says

2017-01-06 at 12:48
1. The data used in deep learning is usually 32-bit or 4 bytes; this is the 4 in the
calculation above (conversion into bytes).
2. This is a bit complicated. Parallelism does not scale linearly, so that you
should multiply the estimate by 3.5 or so (for TensorFlow this will be closer to
2.5-3). One thing to keep in mind is that in practice small data transfers are
often slower (the overhead is large when the data size is small) and that GPUs
operate more efficient on larger batch sizes.
I am unfamiliar with the exact internals of the TensorFlow batching procedure. If

they do it right for both data loading and data transfer a PCIe SSD would lend
a bit improved performance. However, from some benchmarks it seems that
some parts (GPU transfers I think) TensorFlow is sub-optimal. If this is really so
for GPU transfers then a PCIe SSD and a normal SSD would lend the same
performance. I personally would just go for a cheap normal SSD.
Reply
Michael says
2017-03-25 at 03:22
Thanks Tim. A different question: which software framework would you use
for experimenting with Imagenet?
So far I’ve been using Theano, but only on small datasets (MNIST and
CIFAR). My main interest is to test different quantization methods for
weights and activations, and see how it works for different network
architectures. I’ve read your paper, by the way, very interesting, but I prefer
not to code everything from scratch in C/CUDA if possible. Right now I’m
looking into implementation of the asynchronous batch allocation, like you
suggested, in Theano, and it’s not very straightforward.
Would you recommend switching to TensorFlow, or sticking with Theano?
I’m less concerned with the ability to parallelize code across multiple GPU,
because I can just run different experiments in parallel.
Reply
Tim Dettmers says

2017-03-26 at 13:54
TensorFlow is a good call. If you want to work on vision only Caffe is

also an excellent option. However, overall PyTorch or Torch might be
more suitable to you. PyTorch already implements asynchronous
batching by default and Torch already has the 1-but quantization
method. I am currently not sure how that is integrated into PyTorch,
but since both Torch and PyTorch are only wrappers for lua and
python, respectively, interfacing with 1-bit quantization should be
relatively straighforward. If you want to implement other methods of
quantization, then Torch and to some degree PyTorch some good
interfacing and easy extension. However, the algorithms would need to
be written in C/CUDA. Extending TensorFlow in this way might not be
as straightforward, so you might run into difficulties either way.
TensorFlow is of course still more popular and thus if you extend it, it
will have more value for other people. So not an easy decision, but
maybe I could give you some points which makes a decision easier.
Reply
Nader says
2016-12-30 at 05:14
Hi,
What do you think of the following build ?
https://pcpartpicker.com/list/8sv2jc
Thank you
Reply
Pawel says
2016-12-29 at 12:57
Hi Tim,
I’ve just upgraded from GTX 960 to GTX 1070.

I used to run the tensorflow cifar10_multi_gpu_train.py file to check the speed from
one release to another. With the last tensorflow release it peak at about 1500
images / sec with GTX 960 (which is an impressive progress btw, with the initial
releases it was more like 750 images / sec).
I was suprised to see that my GTX 1070 peaks at ~1700 images / sec, a very small
improvement. It looks like the CPU is now the bottleneck (I see it constantly at 300%
usage – 3 full cores). I have a i5-3570k which should be decent.
I didn’t analysed it further (yet) but could samebody share an experience on that ? I
wasn’t expecting the CPU to be the bottleneck here.
Reply
Tim Dettmers says

2017-01-03 at 21:37
Are you training on multiple GPUs (cifar10_multi_gpu_train.py)? If so, then this is

your answer. TensorFlow has terrible performance for multiple GPUs and
upgrading multiple GPUs will not yield much better performance for
TensorFlow.
Reply
Piotr Czapla says

2017-11-07 at 22:36
Paweł,
Tensorflow do not use GPU to GPU transfer when updating weights. It
download the whole model to RAM and make updates on CPU. At least this is
the understanding i’ve got from reading: https://arxiv.org/abs/1608.07249
Reply
Nader says
2016-12-11 at 16:10
Should i buy a GTX 1080 now or wait the ti which is supposedly coming out next
month?
Reply
Tim Dettmers says

2016-12-13 at 12:36
The GTX 1080 Ti will be better in all of the ways. Make sure however to
preorder it or something, otherwise all cards might be bought up quickly and
you have to go back to the GTX 1080. Another strategy might be to wait a bit
longer for the GTX 1080 Ti to arrive and then buy a cheap GTX 1080 from eBay.
I think these two choices make sense if you can wait for a month or two.
Reply
Nader says
2016-12-30 at 05:16
Hi,
What do you think of the following build ?
https://pcpartpicker.com/list/8sv2jc
Thank you
Reply
Nader says
2016-12-30 at 05:16
Thank you for your reply.

I appreciate it.
Reply
Andrew says
2017-02-10 at 08:44
I would personally recommend a couple small changes. Here’s what i

would go with: https://pcpartpicker.com/list/bx2ssJ
First off, if you are going to spend $85 on a 256GB regular SATA based
SSD for storage then you might as well get the top of the line M.2 960
Evo for $120. It’s over 3 times faster than the one you picked in transfer
speeds, and is overall much better. (alternatively if you don’t care about
the extra speed you can get a 500GB SATA drive for about that same
price, getting double the storage)
The second thing i would change is to get a Z270 motherboard rather

than a Z170. It’s been a month or so since you commented so not sure
if you bought yours yet, but the new Z270 motherboards support
more PCI lanes, support 4K encoding on 7000 series CPUs etc.. so
they’re worth looking at, especially since they’re basically the same
price. My link swapped in a Gigabyte Z270 Gaming K3 for your
Gigabyte Z170 Gaming M3. Very Similar boards.
Lastly, you should also get the i5 7600K instead of the i5 6600K since
Kaby Lake 7000 processors are about 5-10% faster than Skylake 6000
processors, and the 7600K can be overclocked to over 5ghz no
problem compared to the 6600K that has trouble getting over ~4.7ghz
on air cooling in some cases. And since the 7600K is also about the
same price you might as well get it. Personally though i would still
recommend an i7 over an i5 in this situation simply because
simultaneous multi-threading is becoming fairly more important as of
late, and the extra 2MB of L3 cache is also nice to have. I figure if you
are spending $1200 on a TITAN X Pascal you should be able to fit in
$100 more for an i7 7700K that can also be overclocked to 5ghz pretty
easily in most cases (even on air!)
Reply
Table Salt says

2016-12-08 at 15:56
Hi Tim, thanks for the excellent posts, and keep up the good work.
I am just beginning to experiment with deep learning and I’m interested in
generative models like RNNs (probably models like LSTMs, I think). I can’t spend
more than $2k (maybe up to $2.3k), so I think I will have to go with a 16-lane CPU.
Then I have a choice of either a single Titan X Pascal or two 1080s. (Alternatively, I
could buy a 40-lane CPU, preserving upgradability, but then I could only buy a
single 1080). Do you have any advice specific to RNNs in this situation? Is model
parallelism a viable option for RNNs in general and LSTMs in particular?
Thank you!
Reply
Tim Dettmers says

2016-12-13 at 12:24
I think you can apply 75% of state-of-the-art LSTM models on different tasks
with a GTX 1080; for the other 25% you can often create a “smarter”
architecture which uses less memory and achieves comparable results. So I
think you should go for 16 lanes and two GTX 1080. Make sure your CPU
support two GPUs in a 8x/8x setting.
Reply
Om says
2016-12-05 at 04:13
Hi Tim,
You are such a amazing person. So patient and knowledgable.
I am also in the same boat of deep leaning and willing to learn.

I brought this computer
http://www.costco.com/CyberpowerPC-SLC2400C-Desktop—Intel-Core-i7—8GB-
NVIDIA-GeForce-GTX-1080-Graphics—Windows-10-
Professional.product.100296640.html
CyberpowerPC SLC2400C Desktop – Intel Core i7 – 8GB NVIDIA GeForce GTX 1080
Graphics – Windows 10 Professional
This is gaming PC but i don’t play games.
My question is can i use “Titan Pascal X” from Nvidia along with GeForce GTX 1080
for more computation power.
I learned SLI is not solution and anyways both are different GPUs .
So in order to achieve faster result can i combine both GPU for Tensorflow.
I am using tensorflow –
I just found this – (Basic Multi GPU Computation in TensorFlow)
https://tensorhub.com/donnemartin/4_multi_gpu
I need to install a VM with Ubantu 16 for all this setup.
Thanks
Reply
Tim Dettmers says

2016-12-13 at 12:14
Hi Om,
I am really glad that you found the resources of my website useful — thank you
for your kind words!
The thing with the NVIDIA Titan X (Pascal) and the GTX 1080 is that they use
different chips which cannot communicate in parallel. So you would be unable
to parallelize a model on these two GPUs. However, you would be able to run
different models on each GPU, or you could get another GTX 1080 and
parallelize on those GPUs.
Note that using a Ubuntu VM can cause some problems with GPU support. The
last time I checked it was hardly possible to get GPU acceleration running
through a VM, but things might have changed since then. So I urge you to
check if this is possible first before you go along this route.
Best,
Tim
Reply
Gordon says
2016-12-01 at 13:21
Thank you very much for writing this! – knowing something about how to evaluate
the hardware is something I have been struggling to get my head around.
I have been playing with TensorFlow on the CPU on a pretty nice laptop (fast i7 with
lots of RAM and an SSD but ultimately dual core so slow as hell).
I want try something on the GPU to see if it is really just 100’s of times faster, but I
am worried about investing too much too soon as I have not had a desktop in
ages.. having read this post and the comments I have the following plan:
Use an existing freenas server I have as a test bed and buy a relatively low end GPU
– GTX 960 4096MB:
https://www.overclockers.co.uk/msi-geforce-gtx-960-4096mb-gddr5-pci-express-
graphics-card-gtx-960-4gd5t-oc-gx-319-ms.html
The freenas box has a crappy celeron core 2 3.2 dual core and only 8GB of Ram.:
http://ark.intel.com/products/53418/Intel-Celeron-Processor-G550-2M-Cache-
2_60-GHz
I will buy the graphics card and an SSD to install an alternative OS on, I *may*
upgrade the ram and processor too as all of these items will all benefit the freenas
box anyway (i also run plex on it).
If this goes well and I develop further I will look at a whole new setup later with
appropriate motherboard, cpu, etc. but in the mean time i can learn how to to
identify where my specific bottle necks are likely to be etc.
From what you have said here i think there will be several slow parts to my system
but I am probably going to get 80-90% of the speed of the graphics, the main
restriction being that the cpu only supports PCIe 2.0 – as everything else while not
ideal and scale-able for that GPU can probably feed it fast enough.
I have 2 questions (if you have time – sorry for long comment but i wanted to make
my situation clear):
1. Do you see anything drastically wrong with this approach? – no guarantees

obviously, I could spend more money now if i am just shooting myself in the foot
but i would rather save it for the next system once i am fully committed and have
more experience.
2. I chose the GPU based on RAM, number of CUDA cores and Nvidia compute
capability rating (which reminds me of windows performance rating – a bit
vague but better than nothing).. the other one i was considering was this £13 more
so also a fine price imho:
https://www.overclockers.co.uk/palit-geforce-gtx-1050ti-stormx-4096mb-pci-
express-gddr5-graphics-card-gx-03t-pl.html
Which has less cores 768 vs 1024 but a shorter process length, higher speed
1290MHz vs 1178MHz, and i *think* i higher rating assuming that the Ti is just better
(seems to mean unlocked) 6.1 vs 5.2:
https://developer.nvidia.com/cuda-gpus#collapse4
Basically is the drop in cores really made up for to such a drastic extent that this
significantly higher rating from nvidia is accurate.. noting that i am probably going
to be happy enough either way – feel free to just say “either is probably fine”
Alternatively if there is something else in the sub £150 ish range that you would
suggest given that the whole thing may be replaced by a titan x or similar
(hopefully cheaper after Christmas ) if this goes well. I did consider just getting
something like this: much less ram but still was more cores than 2 and allows me to
figure out how to get code running on the GPU:
https://www.overclockers.co.uk/asus-geforce-gt-710-silent-1024mb-gddr3-pci-
express-graphics-card-gx-396-as.html
Reply
Gordon says
2016-12-02 at 10:49
Got the 1050 Ti (well another variation of it), i figured they would be similar
regardless so i might as well trust nvidias rating.
https://www.amazon.co.uk/gp/product/B01M66IJ55/
Also got 32 GB or ram and a quadcore i5 that supports pci 3.0 as they were all
cheap on ebay. (SSD too of course).
Looks like i can mount my zfs pool in ubuntu so i will probably just take freenas
offline for a while and use this as a file and plex server too (very few users
anyways) and this way my raid array will be local should i want to use it.
Reply
Tim Dettmers says

2016-12-02 at 20:18
That sounds solid. With that you should easily get started with deep
learning. The setup sounds good if you want to try out some deep learning
on Kaggle.com for example.
Reply
Tim Dettmers says

2016-12-02 at 20:16
Upgrading the system bit by bit may make sense. Note that CPU and RAM will
make no difference to deep learning performance, but might be interesting for
other applications. If you only use one GPU a PCIe 2.0 will be fine and will not
hurt performance. The GTX 960 and GTX 1050Ti are on a par in terms of
performance. So pick what is most convenient / cheaper for you.
Reply
Mor says
2016-11-28 at 19:16
Hi Tim,
I am willing to buy a full hardware to deep learning,
my budget is about 15,000$
I don’t have any experience in this and when I tried to check things out it was too
complicated for me to understand,
Can you help me ? maybe recommend about companies or anything else that suits
my budget and still be good enough to work with?
Thanks a lot
Reply
Tim Dettmers says

2016-11-29 at 16:03
If I were you I would put together a PC on pcpartpicker.com with 4 GPUs and

then build it together by myself. This is the cheapest option. If that is too
difficult, then I would look for companies that sell deep learning desktops. They
basically sell the same hardware, but at a higher price.
Reply
JP Colomer says
2016-11-24 at 22:47
Hi Tim,
Thank you for this excellent guide.
I was wondering, now that the new 1000 series and Titan X came out, what are your
updated suggestions for GPUs (no money, best performance, etc)?
Reply
Tim Dettmers says

2016-11-29 at 15:57
Please, see my GPU blog post for these updates.
Reply
JP Colomer says
2016-12-05 at 07:19
Thank you, Tim. I ended up buying a GTX 1070.

Now, I have to purchase the MOBO. I’m deciding between a GIGABYTE
GA-X99P-SLI and a Supermicro C7X99-OCE-F.
Both support 4 GPUs but it seems that there is not enough space for a 4th
GPU on the Supermicro. Any experience with these MOBOs?
This is my draft https://pcpartpicker.com/list/6tq8bj
Reply
Tim Dettmers says

2016-12-13 at 12:18
Indeed, the Supermicro motherboard will not be able to hold a 4th

GPU. I also have a Gigabyte motherboard (although a different one)
and it worked well with 4 GPUs (while I had problems with an ASUS
one), but I think in general most motherboards will work just fine. So
seems like a good choice.
Reply
Shahid says
2016-11-10 at 10:51
I am confused between two options:
1) A 2nd Generation core i5, 8GB DDR3 RAM and a GTX 960 for $350.
2) A 6th Generation core i3, 16GB DDR3 RAM and a GTX 750Ti for $480.
Can you please comment? I expect to upgrade my GPU after a few months.
Reply
Tim Dettmers says

2016-11-10 at 23:28
A difficult choice. If your upgrade your GPU in a few months then it depends if
you use your desktop only for deep learning or also for other tasks. If you use
your machine regularly, I would spend the extra money and go for option (2). If
you want to almost exclusively deep learning with the machine (1) is a good,
cheap choice. Here the choice also depends if you buy the 2GB or 4GB variant
of each GPU. In terms of speed (1) will be about 33-50% faster, but the speed
would not be too important when you start out with deep learning, specially if
you upgrade the GPU eventually.
Reply
Shahid says
2016-11-11 at 06:56
Thank you Tim, you really inspire me! Actually I took the Udacity SDCND
course, and here is the list of a few projects I want to accomplish on a local
machine:
1. Road Lane-Finding Using Cameras (OpenCV)

2. Traffic Sign Classification (Deep Learning)
3. Behavioral Cloning
4. Advanced Lane-Finding (OpenCV)
5. Vehicle Tracking Project (Machine Learning and Vision)
So, my work is solely related to Computer Vision and Deep Learning. I also
have an option to a GTX 1060 6GB with that core i3 (2). Off course, I expect
to code the GPU versions of OpenCV tasks. Do you think this 3rd option
would be sufficient to accomplish these projects in average amount of
time? Thank you again.
Reply
Gautam Sharma says

2016-11-21 at 16:52
Hi Shahid. I’m in the same boat as yours. Even I have signed up for the
SDCND. I have an old PC with core i3 and 2GB RAM. I am adding
additional 8GB RAM and buying GTX 1060 6GB. This is a really
powerful GPU which’ll perform great in our work associated with the
SDCND.
Reply
meskie sprawy says

2016-11-08 at 20:11
As I website possessor I believe fpfoggd the content material here is rattling

excellent , appreciate it for your hard work. You should keep it up forever! Best of
luck.
Reply
Tim Dettmers says

2016-11-10 at 23:05
Thank you! I aim to keep it up forever
Reply
Hrishikesh Waikar says

2016-11-04 at 07:58
Hi Tim ,
Wonderful article . However I am about to buy a new laptop . So what do you feel
about the idea of gaming laptop for deep learning with Nvidia GTX 980 M , GTX
1060/1070 ?
Reply
Tim Dettmers says

2016-11-07 at 11:33
Definitely go for the GTX 10 series GPUs for your laptop since these are very
similar to full desktop GPUs. They are probably more expensive though.
Another option would be to buy a cheap, light laptop with long-battery
duration and a separate desktop to which your connect remotely to run your
deep learning work. The last option is what I use and I am quite fond of it.
Reply
Alisher says
2016-11-08 at 06:42
I am very happy that I thought as you did. I bought Macbook Air which is
very portable, and going to buy a desktop with better specifications to do
my experiments on it.
I had a question but I have asked it in previous comment.
Thank you again for the very useful information.

Regards,
Reply
Tim Dettmers says

2016-11-10 at 23:04
You are welcome! I am glad I could help out!

Tim
Reply
panovr says
2016-11-01 at 02:53
Great article, and thanks for sharing!

I want to configure my working layout like yours: “Typical monitor layout when I do
deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code;
right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and
other small applications.”
Do I need extra configuration in addition to connect 3 monitors to the
motherboard? Is there any additional hardware need for this 3 monitors
configuration?
Thanks!
Reply
Tim Dettmers says

2016-11-07 at 11:27
No extra configuration is required other than the normal monitor configuration

for your operating system. Your GPU needs to have enough connectors and
support 3 monitors (most modern GPUs do).
Reply
Poornachandra Sandur says

2016-10-30 at 09:47
Hi Tim,
Thank you for sharing your knowledge it was very much beneficial to understand
the concepts in DL.
I have a doubt
How to feed custom images into CNN , for object recognition using Python
language.Please give some pointers on this.
Reply
Tim Dettmers says

2016-11-07 at 11:25
You will need to rescale custom images to a specific size so that you can feed
your data into a CNN. I recommend looking at ImageNet examples of common
libraries (Torch7, Tensorflow) to understand the data loading process. You will
then need to write an extension which resizes your images to the proper
dimension, for example 1080×1920 -> 224×224.
Reply
Alisher says
2016-11-08 at 06:36
Firstly, I am very thankful for your post. It is very nice and very helpful.
One thing I wanted to point is; you can feed the images into network (in
caffe) as they are. I mean if you have 1080×1920 image, there is no need to
reshape it to 224×224. But, this does not mean that feeding the image as
is perform better, I think this can be standalone research topic
Secondly, I am planning to buy a desktop PC; and since I am a Deep

Learner Researcher (beginner) I am going to do a lot experiments on
ImageNet, etc large scale datasets. Do you suggest to buy the gaming PCs
directly, or would it be wise choice to build my own PC?
I was considering to buy Asus ROG G20CB P1070.
Thank you very much in advance!
Regards,
Reply
Tim Dettmers says

2016-11-10 at 23:03
Building your own PC would be a better choice in the long-term. It can

be daunting at first, but it is often easier than assembling IKEA furniture
and unlike IKEA furniture there are multitude of resources on how to
do it step-by-step. After you have build your first desktops, building
the next desktops will be easy and rewarding and you will save a lot of
money to boot!
Reply
Shahid says
2016-11-12 at 06:45
Thank you Tim!
Prasanna Dixit J says

2016-10-28 at 06:38
This is good overview on the HW that matters to the DL, Would like your view on
the OpenPower -NVIDIA combo, and economics of setting up a ML/DL lab.
Reply
Tim Dettmers says

2016-11-07 at 11:21
I think that non-consumer hardware is not so economically efficient for setting

up a ML/DL lab. However, beyond a certain threshold of GPUs the traditional
consumer hardware no longer is an option (NVIDIA will not sell you consumer-
grade GPUs on bulk and there might also be problems with reliability). I would
recommend to get as much traditional, cheap, consumer hardware as possible
and mix it with some HPC components like cheap Mellanox Infiniband cards
and switches from eBay.
Reply
Arthur says
2016-10-24 at 22:24
Great hardware guide. Thank you for sharing your knowledge.
Reply
Shashwat Gupta says

2016-10-21 at 14:26
Hey, I wanted to ask if the nvidia quadro k4000 will be a good choice for running
convolutional nets?
Reply
Tim Dettmers says

2016-10-24 at 14:08
A K4000 will work, but it will be slow and you cannot run big models on large
datasets such as ImageNet.
Reply
Shashwat Gupta says

2016-10-24 at 14:26
Shall I get a GTX1080 instead?
Reply
Ashiq says
2016-10-19 at 19:06
Hi Tim
Thanks for the great article and your patience to answer all the questions. I just built
a dev box with 4 Titan X Pascal and need some advice on air flow. For reference,
here is the Part list: https://pcpartpicker.com/list/W2PzvV and the
Picture: http://imgur.com/bGoGVXu
Loaded Windows first for stress testing the components and noticed the GPUs
temps reached 84C while the fans are still at 50%. Then the GPUs started slowing
down to lower/maintain the temp. Then with MSI Afterburner, I could specify a
custom temp-vs-fanspeed profile and keep the GPU temps at 77C or below –
pretty much what you wrote in the cooling section above.
There is no “Afterburner” for Linux, and apparently the BIOS of the Titan X Pascal is
locked so we can’t flash them with custom temp setting. The only option left for me
is to play with the coolbits and I prefer not to attach 4 monitors to it (I already have
two 30inch monitors that are attached to a windows computer that I use for
everything. 6 monitors on the table will be too much).
I wonder if you found any new way of emulating monitors for Xorg as my preferred
option would be keep 3 of the GPUs headless ?
Cheers
Ashiq
Reply
Tim Dettmers says

2016-10-24 at 14:16
I did not succeed in emulating monitors so myself. Some other claim that they
got it working. I think the easiest way to increase the fan speed would be to
flash the GPU with a custom BIOs. That way it will work in both Windows and
Linux.
Reply
spuddler says
2016-10-26 at 15:34
Not sure, but there maybe there exist specific dummy plugs to help
“emulating” monitors, if it’s not possible purely by software. At least DVI and
HDMI-dummy plugs worked for cryptocurrency miners back in the day.
Reply
Ashiq says
2016-10-27 at 03:41
So I got it (virtual screens will coolbits) working by following the clues

from http://goo.gl/FvkGC7. Here (https://goo.gl/kE3Bcs) is my Xserver cofig
file (/etc/X11/xorg.conf ) and I can change all 4 fan speeds with nvidia-
settings
Reply
Tim Dettmers says

2016-10-27 at 11:20
Thanks Ashiq — that sounds great! Thank you for sharing the link!
Reply
Piotr Czapla says

2017-11-07 at 21:48
Hi Ashiq,
Would you mind sharing how loud is your setup. It look very similar to the one
I’m planning to build and I’m torn between going for liquid cooling or air
cooling. Will I be able hear it from 10 meters away?
Regards,
Piotr
Reply
anon says
2016-10-19 at 18:07
Hi Tim,
Could you recommend any Mellanox ConnectX2 cards for GPU-RDMA ?

Some are just Ethernet (MNPA19 XTR, for e.g. ) and I wonder if those can be flashed
to support RDMA or maybe I should just buy a card which supports Infiniband
outright ?
Reply
anon says
2016-10-02 at 01:09
Hi Tim,
I just got 5 dell precision t7500 in an auction.

haven’t received them yet, but the description mentions Nvidia Quadro 5000
installed.
Would it be worth replacing them or are they enough for starting out ?
Machines themselves have 12GB of DDR3(ECC i presume) RAM and Xeon 5606 as
described.
Reply
Tim Dettmers says

2016-10-03 at 15:04
The Quadro 5000 has only a compute capability of 2.0 and thus will not work
with most deep learning libraries that use cuDNN. Thus it might be better to
upgrade.
Reply
anon says
2016-10-04 at 19:59
Thanks.
I am thinking of going with GTX 1060.

Is there any difference though between EVGA , ASUS, MSI or NVIDIA
versions ?
These are the options I see when search on ebay .
Reply
Gautam Sharma says

2016-11-21 at 16:40
That should matter much. Don’t go with the Nvidia founder’s edition. It
doesn’t have a good cooling system. Just go with the cheapest one
which is EVGA. It is one of the most promising brand. I just ordered the
EVGA one.
Reply
Tim Dettmers says

2016-11-21 at 20:02
Please note that the GTX 1080 EVGA has currently cooling
problems with are only fixed with flashing the BIOS of the GPU.
This card may begin to burn without this BIOS update.
Toqi Tahamid says

2016-09-25 at 13:40
My current CPU is Intel Core i3 2100 @ 3.1Ghz and RAM is 4GB. My motherboard is
Gigabyte GA-H61M-S2P-B3 (rev. 1.0) . It has support PCIe 2.0. Can I use GTX 1060 in
my current configuration or do I need to change the board and the CPU? I want to
keep the cost as much as low.
Reply
Tim Dettmers says

2016-09-26 at 12:54
You should be able to run a GTX 1060 just fine. The performance should be
only 5-10% less than on an optimal system.
Reply
zac zhang says

2016-09-22 at 11:40
Awsome! Thanks for your sharing. Can you tell me how much will them cost to
build up such a cluster? Cheers!
Reply
Tim Dettmers says

2016-09-23 at 12:45
Basically it is two regular deep learning systems together with infiniband cards.
You can get infiniband card and a cable quite cheap on eBay and the total cost
for a 6 GPU, 2 node system would be about 3k for the system and infiniband
cards, and an additional 6k for the GPUs (if you use Pascal GTX Titan X) for a
total of $9k.
Reply
Shravankumar says
2016-09-21 at 10:45
I am using Asus K55VJ, i5 3rd gen, Nvidea Geforce GT 635M- 2GB, with 750HDD
and 8GB RAM. Does my computer supports deep learning?
Reply
Tim Dettmers says

2016-09-23 at 12:43
Your GPU has compute capability of 2.1 and you need at least 3.0 for most
libraries — so no, your computer does not support deep learning on GPUs.
You could still run deep learning code on the CPU, but it would be quite slow.
Reply
Jacqueline says
2016-09-16 at 07:04
Hi Tim
Is this one a good one for a Deep Learning Researchers?
https://www.bhphotovideo.com/c/product/1269213-
REG/asus_g20cb_db71_gtx1070_republic_of_gamers_g20cb.html
thank you!
Reply
Tim Dettmers says

2016-09-17 at 16:33
It is a bit pricey and there are not much details about the motherboard. Also
the GPU might be a bit weak for researchers.
I would also encourage you to buy components and build them together on
your own. This may seem like a daunting task but it is much easier than it
seems. This way you get a high-quality machine that is cheap at the same time.
Reply
Gilberto says
2016-09-08 at 09:39
Hi Tim,
first of all thank you for sharing all these precious information.
I am new to neural network and python.

I want to test some ideas on financial time series.
I’m starting to learn python, theano, keras.
After reading your article, I decided to upgrade my old pc.

I know almost nothing about hardware so I ask you an opinion about it.
Current configuration:
– Motherboard: Gigabyte GA-P55A-UD3 (specification
at: http://www.gigabyte.com/products/product-page.aspx?pid=3439#sp)
– Intel i5 2.93 GHz
– 8 Gb Ram
– GTX 980
– PSU power: 550watts
I may add:
– Ssd Hard Drive (I will install Ubuntu and use it only by command line – not
graphical interface)
The power supply is powerful enough for the new card?

Does the motherboard support the new card?
Thank you very much,

Gilberto
Reply
Tim Dettmers says

2016-09-10 at 03:52
The motherboard should work, but it will be a bit slower. The PSU is borderline,
it might be a bit too few watts or just right, its hard to tell.
Reply
Arman says
2016-08-26 at 17:49
Hi Tim,
I had a question about the new pascal gpu’s. I am debating between Gtx 1080 and
Titan X. The price of Titan X is almost double the 1080’s. Excluding the fact that
Titan X has 4 more Gb memory, does it provide significant speed improvement
over 1080 to justify the price difference?
Thanks,
Reply
Juan says
2016-09-05 at 00:24
Hi,
I am not Tim (obviously), but as far as I understood from his other post on GPU
(http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/) he states
that for research level of work it actually is a difference, maxime when you are
are using videosets. But for example … “While 12GB of memory are essential for
state-of-the-art results on ImageNet on a similar dataset with 112x112x3
dimensions we might get state-of-the-art results with just 4-6GB of memory.”
Hope this can help you.
Reply
DarkIdeals says
2016-09-10 at 05:55
If you can afford it the TITAN X is DEFINITELY worth it over the 1080 in most
cases. Not only does it have that 12GB of VRAM to work with but it also has
features like INT8 (the way i understand it, is that you can store floats as 8 bit
integers which helps efficiency etc.. Potentially quite useful) and has 44 TOP
units (kinda like ROPs but not for graphic rendering, they are beneficial to Deep
Learning though)
Basically the TITAN X is literally identical to the $7000 Tesla P100 just without the
Double Precision FP64 capability and without HBM2 memory (The TITAN X
uses GDDR5X instead, however it’s not much of a difference as the P100’s
memory bandwidth even with the HBM is only 540 GB/second whereas the
TITAN X is very close at 480 GB/second and hits 530 GB/second when you
overclock the memory from 10,000mhz to 11,000mhz so it’s literally no
difference really) Other than those things and the certified Tesla Drivers there’s
literally no real difference between the P100 and the TITAN X Pascal; which is
very important as the Tesla P100 is literally THE most powerful graphic card on
the planet right now!
The important thing to mention is that Double Precision isn’t really important
for Neural nets etc.. that you deal with in Deep Learning; so for $1,200 you are
getting the power of the $7,000 monster supercomputer chip of the Tesla P100
just without all the unnecessary server features that Deep Learning doesn’t use.
Also, in comparison to the GTX 1080, the TITAN X has a significant advantage in
both memory capacity (12GB vs 8GB on 1080), memory bandwidth (530 GB/s
when overclocked on the TITAN X, vs 350 GB/s on the 1080 when overclocked…
that’s a FIFTY PERCENT increase in memory bandwidth!), and has a massive
increase in CUDA cores which is very beneficial (40% more, which when
combined with the double memory capacity and 50% higher bandwidth easily
nets you ~60% more performance in some scenarios over the 1080)
Hope this helps, the TITAN X is a GREAT chip for Deep Learning, the best in the
world currently available in my opinion. Which is why i bought two of them.
Reply
DarkIdeals says
2016-11-09 at 01:32
(sorry for the long post but it is important to your decision so try to read it all if
you have time)
Hey, correcting an error in my earlier post. LIke i said i wasn’t quite sure if i
understood the INT8 functionality properly. and i was wrong about it.
Apparently there was a typo in the spec pages of the Pascal TITAN X, it said “44
TOPs” and made me think it was an operation pipeline of sorts similar to a
“ROP” which is responsible for displaying graphical images etc..
It actually was referring the the INT8, which is basically just 8 bit integer
support. The average GPU runs with 32 bit “full precision” accuracy, which is a
measurement of how much time and effort is put into each “calculation” made
by the GPU. For example, with 32 bit it may only go out to 4 decimal points
when calculating for the physics of water in a 3d render etc.. which is plenty
good for things like Video Games and your average video editing and
rendering project; but for things like advanced physics calculations by big
universities that are trying to determine the 100% accurate behavior of each
individual molecule of H2O within the body of water to see EXACTLY how it
moves when wind blows etc.. you would need “double precision” which is a 64
bit calculation that would have much more accuracy, going to more decimal
points before deciding that the calculation is “close enough” compared to what
32 bit would.
Only special cards like Quadro’s and Tesla’s have high 64 performance, they
usually have half the Teraflops of performance at 64 bit mode compared to 32
bit, so a Quadro P6000 (same GPU as the TITAN XP but with full 64 bit support)
it has 12 Teraflops of power at 32 bit mode and ~6 Teraflops of power in 64 bit
mode. But there is also 16 bit mode, “half precision” for things requiring even
less accuracy, INT8 to my understanding is basically “8 bit quarter precision”
mode, with even less focus on total mathematical accuracy; and this is useful
for Deep Learning as some of the work done doesn’t require that much
accuracy,.
So, in other words, in 8 bit mode, the TITAN X has “44 Teraflops” of
performance.
Reply
Tim Dettmers says

2016-11-10 at 23:09
Your analysis is very much correct. However, for some games there are
already some elements which make heavy use of 8-bit Integers. However,
before it was not possible to do 8-bit Integer computation, but you had to
first convert both numbers to 32-bit, then do the computation, and then
convert it back. This would be done implicitly by the GPU so that no
programming was necessary. Now the GPU is able to do it on its own.
However, the support is still quite limited so you will not the 8-bit deep
learning just yet. Probably in a year earliest would be my guess, but I am
sure it will arrive at some point.
Reply
Ionut Farcas says

2016-08-17 at 14:09
First of all, really nice blog and well made articles.

Do you think that spending 240£ more for a 1070 (2048 CUDA cores) instead of a
1060 (1280 CUDA cores) for a laptop? Does the complexity of the most used deep
learning algorithms require the extra 760 CUDA cores?
Thank you.
Reply
Tim Dettmers says

2016-08-18 at 03:40
I am not sure how easy it is to upgrade the GPU in the laptop. If it is difficult,
this might be one reason to go with the better GPU since you will probably also
have it for many years. If it is easy to change, then there is not really a
right/wrong choice. It all comes down to preference, what you want to do and
how much money you have for your hardware and for your future hardware.
Reply
sk06 says
2016-08-17 at 12:54
Hi,
I just bought two Supermicro 7048GR-TR server machine with 4 TitanX cards on
each machine. Im confused how to configure the server. How many partitions I
have to make, how to utilize 256GB SSD drive and two other 4TB hard drives in
each machine. The server will be only used for deep learning applications. What
deep learning framework should I use (TensorFlow or Caffe or Torch) considering
two servers. I work in medical imaging domain. I recently started getting used to
deep learning domain. Please help me with your valuable suggestions.
Link for server configuration:

https://www.supermicro.com.tw/products/system/4u/7048/SYS-7048GR-TR.cfm
Thanks and Regards

sk06
Reply
Tim Dettmers says

2016-08-18 at 03:37
The servers have a slow interconnect, that is the servers only have a gigabit
Ethernet which is a bit too slow for parallelism. So you can focus on setting up
each server separately. It depends on your dataset size, but you might want to
have the SSD drive dedicated for your datasets, that is, install the OS on the
hard drive. If your datasets are < 200GB, you could also install the OS on the
SSD to have a smoother user experience. The frameworks all have their pros
and cons. In general I would recommend TensorFlow, since it has the fastest
growing community.
Reply
sk06 says
2016-08-24 at 05:10
Thanks for the suggestions. I tried training my application with 4 gpus in

the new server. To my shock, training the alexnet took 2.30 Hrs with 4 gpus
while training the alexnet took 35 mins with single gpu. I used caffe for this.
Please let me know where am I going wrong..! The batch size and other
parameter settings are same as in the original paper.
Thanks and Regards

sk06
Reply
chanhyuk jung says

2016-08-16 at 09:02
I just started learning about neural networks and I’m looking forward to studying it.
I have a gt 620 with a dual core pentium g2020 clocked at 3.3 ghz with 8gb of ram.
Would it be better to buy a 1060 and two 8gb rams for the future?
Reply
Tim Dettmers says

2016-08-17 at 06:15
Yes, the GT620 will not support cuDNN which is important deep learning
software and makes deep learning just more convenient, because it allows you
more freedom in choosing your deep learning framework. You will have less
troubles if you buy a GTX 1060. 16GB of RAM will be more than enough, I think
even 8GB could be okay. Your CPU will be sufficient, no update required.
Reply
Vasanth says
2016-08-11 at 17:22
Hi Tim,
Many thanks for this post, and your patient responses. I had a question to ask –
NVIDIA gave away Tesla K40C (which is the workstation version of K40, as I
understand) as part of its Hardware Grant Program (I think they are giving TitanX
now, but they were giving Tesla K40Cs until recently). It’s not clear to me what
workstations from standard OEMs like Dell/HP are compatible with a K40C. I have
spoken to a few vendors about compatibility issues, but I don’t seem to get
convincing responses with knowledge. I am concerned about buying a workstation,
which would later not be compatible with my GPU. Would it be possible for you to
share any pointers you may have?
Thank you very much in advance.

Reply
Tim Dettmers says

2016-08-13 at 21:51
The K40C should be compatible with any standard motherboard just fine. The
compatibility that hardware vendors stress if often assumed for datasets where
the cards run hot and need to do so permanently for many months or years.
The K40 has a standard PCIe connector and that is all that you need for your
server motherboard.
Reply
Wajahat says
2016-08-11 at 15:27
Hi Tim
Thanks a lot for your useful blog.
I am training CNN on CPU and GPU as well.

Although the weights are randomly initialized , but I am setting the random seed to
zero in the beginning of the training. Still, I am getting different weights learnt for
CPU than GPU. The difference is not huge (e.g. -0.0009 and -0.0059, or 0.0016 and
0.0017), but there is a difference that I can notice. Do y ou have any idea how this
could be happening? I know it is a very broad question, but what I want to ask is, is
this expected or not?
I am using MatlabR2016a with MatConvNet 1.0 beta20 (Nvidia Quadro 410 GPU in
Win7 and GTX1080 in Ubuntu 16.04), Corei7 4770 and Corei7 4790.
Exactly same data with same network architecture used.
Best Regards
Wajahat
Reply
Tim Dettmers says

2016-08-13 at 21:45
This can well be true and normal. The seed itself can produce different random
numbers on CPU and GPU if different algorithms are used. Convolution on
GPUs mal also include some non-deterministic operations (cuDNN 4). When
using unit tests to compare CPU and GPU computation, I also often have some
difference in output given the same input, thus I assume that there are also
small differences in floating point computation (although very small). All this
might add up to your result.
Reply
Arman says
2016-08-07 at 22:02
Thanks for the great guide.

I had a question. What is the minimum build that you recommend for hosting a
Titan X pascal?
Reply
Tim Dettmers says

2016-08-08 at 06:48
For a single Titan X Pascal and if you do not want to add another card later
almost any build will do. The CPU does not matter; you can buy the cheapest
RAM and should have at least 16 GB of it (24 GB will be more than enough). For
the PSU 600 watts will do; 500 watts might be sufficient. I would buy a SSD if
you want to train on large data sets or raw images that are read from disk.
Reply
anon says
2016-08-05 at 20:52
How are you so patient with everyone’s questions ?
Reply
Tim Dettmers says

2016-08-06 at 18:09
There are several reasons:

– I led a team of 250 in an online community and people often asked me for
help and guidance. At first I sometimes lend support and sometimes I did not.
However, over time I realized that not helping out can produce problems:
Demotivate people from something which they really want to do but do not
know how to do, produce defects in the social environment (when I do not
help out, others would take example from my actions and do the same) among
others. Once I start lending a hand always, I found that I do not lose as much
time as I thought I would lose. Due to my vast background knowledge in this
online community, it often was faster to help than thinking about if some
question or request was worth of my help. I now always help without a second
thought or at least start helping until my patience grows tired
– Helping people makes me feel good
– I was born with genes which make me smart and which make me understand
some things easier than others. I feel that I have a duty to give back to those
which were less fortunate in the birth lottery
– I believe everybody deserves respect. Answering questions which are easy for
me to answer is a form of respect
I hope that answers your question
Reply
Andrew says
2016-08-15 at 06:59
You are an amazingly good person Tim. The world needs more people like
you. Your actions encourage others to behave in a similar way which in
turn helps build better online and offline communities. Thank you!
Reply
Tim Dettmers says

2016-08-16 at 05:23
Thank you for the kind words!
Reply
Michael Lanier says

2016-08-03 at 16:33
How do the new NVIDIA 10xx compare? I followed through with this guide and
ended up getting a GTX Titan. The bandwidth looks slightly higher for the Titan
series. Does the architecture affect learning speeds?
Reply
Tim Dettmers says

2016-08-04 at 06:51
The bandwidth is high for all Titans, but their performance is different from
architecture to architecture, for example Kepler (GTX Titan) is much slower than
Maxwell (GTX Titan X) even though the have comparable bandwidth. So yes
the architecture does affect learning speed — quite significantly so!
Reply
drh1 says
2016-07-31 at 04:50
hi tim,
thanks for some really useful comments. i have a hardware question. i’ve configured
a Windows 10 machine for some GPU computing (not DL) at the moment. I think
the hardware issues overlap with your blog, so here goes:
the system has a GTX 980 Ti card and a K40 card on an ASUS X-99 Deluxe
motherboard. When the system boots up, the 980 (which runs the display as well) is
fine, but the K40 gives me “This device cannot start. (Code 10). Insufficient system
resources exist to complete the API”. I have the most up-to-date drivers (354.92 for
K40, 368.81 for 980).
Has anyone configured a system like this, and did they have similar problems? Any
ideas will be greatly appreciated.
Reply
Tim Dettmers says

2016-08-04 at 06:45
It might well be that your GPU driver is meddling here. There are separate
drivers for Tesla and GTX GPUs and you have the GTX variant installed and thus
the Tesla card might not work properly. I am not entirely sure to go around this
problem. You might want to configure the system as a headless (no monitor)
server with Tesla drivers and connect to it using a laptop (you can use remote
desktop using Windows, but I would recommend installing ubuntu).
Reply
bmahak says
2016-07-26 at 01:18
I want to build my own deep learning machine using skylake motherboard and cpu.
I am planing not to use more then 2 GPUS (GTX 1080). Starting with one GPU first
and upgrading to a second one if needed.
here is my setup in
pcpartpiker: http://pcpartpicker.com/user/bmahak2005/saved/Yn9qqs
Please tell me what you think about it.
Thanks again for a great article .
HB.
Reply
Tim Dettmers says

2016-08-04 at 06:30
The motherboard and CPU combo that you chose only supports 8x/8x speed
for the PCIe slots. This means you might see some slowdown in parallel
performance if you use both of your GPUs at the same time. The decrease
might vary between networks with roughly 0-10% performance loss. Otherwise
the build seems to be okay. Personally I would go with a bit more watts on the
PSU just to have a save buffer of extra watts.
Reply
David Selinger says

2016-07-24 at 19:17
Hey there Tim,
Thanks for all the info!

I was literally pushing send on an email that said “ORDER IT” to my local computer
build shop when nVidia announced the new Titan X Pascal.
Do you have any initial thoughts on the new architecture? Especially as it pertains
to cooling the NVRAM which usually requires some sort of custom hardware
(cooling plate? my terminology is likely wrong here) will that add additional delay
after purchasing the new hardware?
Thank you sir!
Reply
Tim Dettmers says

2016-07-25 at 06:02
There should be no problems with cooling for the GDDR5X memory with the
normal card layout and fans. I know for HBM2 NVIDIA actually designed the
memory to be actively cooled, but HBM2 is stacked while GDDR5X is not.
Generally GDDR5X is very similar to GDDR5 memory. It will consume less
power but also offer higher density, so that on the bottom line GDDR5X should
run on the same temperature level or only slightly hotter than GDDR5 memory
— no extra cooling required. Extra cooling makes sense if you want to
overclock the memory clockrate, but often you cannot get much more
performance out of it for how much you need to invest in cooling solutions.
Overall the architecture of Pascal seems quite solid. However, most features of
the series are a bit crippled due to manufacturing bottlenecks (16nm, GDDR5X,
HBM2 all these need their own factories). You can expect that the next line of
Pascal GPUs will step up the game by quite a bit. The GTX 11 series probably
will feature GDDR5X/HBM2 for all cards and allow full half-float precision
performance. So Pascal is good, but it will become much better next year.
Reply
David Selinger says

2016-07-25 at 22:23
Cool thanks. That gave me something to chew on.
Last question (hopefully for at least a week : ) ): Do you think that a

standard hybrid cooling closed-loop kit (like this one from
Arctic: https://www.arctic.ac/us_en/accelero-hybrid-iii-140.html) will be
sufficient for deep learning or is a custom loop the only way to go?
– VRM: heatsink + fan
– VRAM: Heatsink ONLY
– GPU: closed-loop water cooled
Obviously will have to confirm the physical fit once those specs become
more available, but insofar as the approach, I was a little bit concerned
about the VRAM.
The use case is convolutional networks for image and video recognition.
Thanks,
Selly
Reply
sk06 says
2016-07-09 at 10:28
Hi Tim,
Thanks for the excellent post. The user comments are also pretty informative. Kudos
to all.
I recently started shifting my focus from conventional machine learning to Deep
Learning. I work in medical imaging domain and my application has a dataset of
50000 color images (5000 per class, 10 classes, size – 512×512). I have a system with
Quadro k620 gpu. I want to train state of the art CNN model architectures like
Googlenet InceptionV3, VGGnet16, alexnet from scratch. Do the QuadroK620 will
be sufficient for training these models. If I have to go for higher end gpu’s, can u
please suggest me which card I should go for? (K1080, TitianX, etc). I want to
generate the prototypes as fast as possible. Budget is not primary.
Reply
Tim Dettmers says

2016-07-09 at 17:01
A QuadroK620 will not be sufficient for these tasks. Even with very small batch
sizes you will hit the limits pretty quickly. I recommend getting a Titan X on
eBay. Medical imaging is a field with high resolution images where any
additional amount of memory can make a good difference. Your dataset is
fairly small though and probably represents a quite difficult task; it might be
good to split up the images to get more samples and thus better results
(quarter them for example if the label information is still valid for these images)
which then in turn would consume more memory. A GTX Titan X should be
best for you.
Reply
John says
2016-07-07 at 23:19
Great article. What would you recommend for a laptop GPU setup rather than a
desktop? I see a lot of laptop builds with a 980M or 970M GPU, but is it worth
waiting for some variant of the 1080M/1070M/1060M?
Reply
Tim Dettmers says

2016-07-09 at 16:55
A laptop with such a high end graphics card is a huge investment and you will
probably use that laptop much longer than people use their desktops (it is
much easier to sell your GPU and upgrade for a desktop). I would thus
recommend to wait for the 1000M series. It seems it will arrive in some months
ahead and the first performance figures show that they are slightly faster than
the GTX Titan X — that would be well worth the wait in my opinion!
Reply
Dante says
2016-07-07 at 20:21
Tim,
Based on your guide I gather that choosing a less expensive hexa core Xeon cpu
with either 28 or 40 lanes will not see a great drop in performance. is that correct?
(1-2 GPUs). Can you share your thoughts?
Great guides. very helpful for folks getting into Deep learning and trying to figure
out what works best for their budget.
Dante
Reply
Tim Dettmers says

2016-07-09 at 16:52
Yes that is very true. There is basically no advantage from newer CPUs in terms
of performance. The only reason really to buy a newer CPU is to have DDR4
support, which comes in handy sometimes for non-deep learning work.
Reply
Simon says
2016-07-07 at 12:34
In general I seek cheaper idea to assembly set without decreased performance.

Does NVIDIA coolbits take possibility to decrease GPU heats up ?
You wrote about “coolbits” on Ubuntu and problem with headless.
Did you hear about DVI or VGA plug dummy, i.e.
http://www.ebay.com/itm/Headless-server-DVI-D-EDID-1920×1200-Plug-Linux-
Windows-emulator-dummy-/201087766664
I think it will be good solution for video card with no monitor attached and no
problems with coolbits control.
Reply
Simon says
2016-07-04 at 15:49
Hi
Asus spec X99-E WS shows that has a PLX chip that provides a additional 48 PCIe
lanes. Getting a i7-6850K with a X99-E WS theoretically gives you 88PCIe lanes in
total and that is still plenty to run 4 GPUs all at x16.
Is that true for deep learning ?
Thx for reply.
Reply
Tim Dettmers says

2016-07-04 at 19:14
I am not exactly sure how this feature maps to the CPU and to software
compatibility. From what I heard so far, you can quite reliably access GPUs from
very non-standard hardware setups, but I am not so sure about if the software
would support such a feature. If the GPUs are not aware of each other on the
CUDA level due to the PLX chip, then this feature will do nothing good for
deep learning (it would probably be even slower than a normal board, because
probably you would need to go through the CPU to communicate between
GPUs).
But the idea of a PLX chip is quite interesting, so if you are able to find out
more information about software compatibility, then please leave a comment
here — that would not only help you and me, but also all these other people
that read this blog post!
Reply
Rikard Sandström says

2016-06-27 at 11:42
Thank you for an excellent post, I keep coming back here for reference.
With regards to memory types, what role does GDDR5 vs GDDR5X play? Is this an
important differentiator between offerings like 1080 and 1070, or is it not relevant
for deep learning?
Reply
gameeducationjournal.info says
2016-06-24 at 22:16
When I initially left a comment I seem to have clicked on the

-Notify me when new comments are added- checkbox and now every time a
comment is
added I get four emails with the exact same comment.
Is there a means you can remove me from that
service? Many thanks!
Reply
Tim Dettmers says

2016-06-25 at 21:32
That sounds awful. I will check what is going wrong there. However, I am unable
to remove a single user from the subscription. See if you can unsubscribe
yourself. Otherwise please contact the jetpack team. Apparently the data is
stored by them and the plugin that I use for this blog access that data as you
can read here. I hope that will help you. Thanks for letting me know.
Reply
Arun Das says

2016-06-21 at 02:55
Wonderful guide ! Thank you !
Reply
Milan Ender says

2016-06-21 at 00:12
Hey,
first of all thanks for the guide, helped me immensely to get some clarity in this
puzzle!
Couple of questions as I’m a bit too impatient to wait for 1080/70 reviews on this
topic:
As you stated, bandwidth, memory clock and memory size seem to be one of the
most important factors so would it even make sense to put some more money in a
solidly overclocked custom GPU? So far I’ll just pick the cheapest solidly cooled one
(EVGA ACX 3.0 probably).
Also my initial analysis between 1070GTX vs 1080GTX was heavily in favor for the
1080 GTX based on the benchmarks from http://www.phoronix.com/scan.php?
page=article&item=nvidia-gtx-1070&num=4 . Though the theoretical TFLOPS SP
MIXBENCH results were closely in favor for the 1070 (76.6 €/TFLOP 1080GTX vs 73.9
€/TFLOP 1070GTX) the SHOC on CUDA results in terms of price efficiency were
closely in favor for the 1080GTX but more or less the same . However the GDDRX5
on the 1080 GTX seem to seal the deal I guess for deep learning applications? Also I
found the 1080 around 6 Watt/TFlops more cost efficient. Am I on the right track
here? Maybe the numbers help some others here searching for opinions on that :).
Anyways after reading through your articles and some others I came up with this
build:
http://pcpartpicker.com/list/LxJ6hq . Some comments would be very appreciated
. I feel like the CPU is a bit overkill but it was the cheapest with DDR4 ram and
40 lanes. Maybe not needed though I’m a bit unsure of that.
Best regards
Reply
peter says
2016-06-19 at 23:15
Hello Tim:
Thanks for the great post. I built the following PC base on it.
CPU: i5 6600
Mother board: Z170-p
DDR4: 16g
GPU: nvidia 1080 founder edition

Power: 750W
However, after I install 14.04, I can’t get CUDA8.0 and the new driver install(which
claim N1080 user has to renew this driver).
Is the problem occur because of the other components of the PC like mother
board?
Thanks!
Reply
Tim Dettmers says

2016-06-23 at 15:51
I have heard that people have problems with Skylake under ubuntu 14.04. But I
am not sure if that is really the problem. You can try upgrade to ubuntu 16.04
because the Skylake support is better under that version, but I am not sure if
that will help.
Reply
Poornachandra Sandur says

2016-06-17 at 07:22
Hi Tim Dettmers,
Your blog is awesome. I currently have GeForce GTX 970 on my system , is that
sufficient for beginning Convolutional Neural Networks.
Reply
Tim Dettmers says

2016-06-18 at 22:37
A GTX 970 is an excellent option to explore deep learning. You will not be able
to train the very largest models, but that is also not something you want to do
when you explore. It is mostly learning how to train small networks on common
and easy problems, such as AlexNet and similar convolutional nets on MNIST,
CIFAR10 and other small data sets, until you get a “feel” for training
convolutions nets so that you then can go on with larger models and larger
data sets (ResNet on ImageNet for example). So everything is good.
Reply
Adrian Sarno says

2016-06-16 at 19:28
I haven’t bee able to boot up this MSI laptop with any of the flavors of 14.04
(lubuntu, xubuntu, kubuntu, ubuntu) , could it be the SkyLake processor that it is
not compatible with 14.04?
https://bugzilla.kernel.org/show_bug.cgi?id=109081
Looks like I will have to wait until a fix is created for the upstream ubuntu versions
or until nvidia updates Cuda to support 16.04. Is there any other thing I can try?
Thanks!
Reply
Tim Dettmers says

2016-06-18 at 22:34
Laptops with a NVIDIA GPU in combination with Linux are always a pain to get
running properly as it is often is also very dependent on your other hardware in
your laptop. I do not have any experience in this case, but you might be able to
install 14.04 and then try to patch the kernel with that you need. Not easy to do
though.
Reply
David Laxer says

2016-06-16 at 17:47
Any comments on MIT’s Eyeriss chip?
Reply
David Laxer says

2016-06-16 at 17:49
http://www.rle.mit.edu/eems/wp-
content/uploads/2016/02/eyeriss_isscc_2016_slides.pdf
Reply
Glenn says
2016-06-16 at 00:30
Thanks for all the info. If I plan to use only one GPU for computation, then would I
expect to need two GPUs in my system: one for computation and another for
driving a couple of displays? Or can a single GPU be used for both jobs?
Reply
Tim Dettmers says

2016-06-16 at 16:25
A single GPU is fine for both. A monitor will use about 100-300MB of your GPU
memory and usually draw an insignificant amount (<2%) of performance. It is
also the easier option, so I would just recommend to use a single GPU.
Reply
Yasumi says
2016-06-15 at 13:26
For deep learning on speech recognition, what do you think of the following specs?
It’s going to cost 2928USD. What are your thoughts on this?
– INTEL CORE I7-6800K UNLOCKED FOR OC(28lanes)(6 CORE/ 12
THREADS/3.8GHZ) NEW!
– XSPC RayStorm D5 Photon AX240 (240L)
– ASUS X99-E WS (ATX/4way SLI/8x Sata3/2xGigabit LAN/10xUSB3.0)
– 4 x GSKILL RipjawsV RED 2x8GB DDR4 2400mhz (CL15)

-ZOTAC GTX1080 8GB DDR5X 256BIT Founder’s Edition (1733/10000)-NEW
– SuperFlower Leadex Gold 650W(80+Gold/Full Modular)*5 Years Warranty
– CORSAIR AIR 540 BLACK WINDOW
– INTEL 540s 480GB 2.5″ Sata SSD (560/480)
Reply
Tim Dettmers says

2016-06-16 at 16:29
This is a good build for a general computation machine. A bit expensive for
deep learning, as the performance is mostly determined by the GPU. Using
more GPUs and cheaper CPU/Motherboard/RAM would be better for deep
learning, but I guess you want to use the PC also for something different than
deep learning :). This would be a good PC for kaggle competitions. If you plan
on running very big models (like doing research) then I would recommend a
GTX Titan X for memory reasons.
Reply
Adrian Sarno says

2016-06-13 at 23:06
thanks so much for your advice! I managed to install Xubuntu 16.04, now the next
step is installing CUDA and TensorFlow, I will need all the advice that I can get with
that one.
The problem I have with Ubuntu Desktop is known, it looks like they are going to
address it in 14.04.1 (sorry for the comment slightly off topic).
http://askubuntu.com/questions/760051/ubuntu-16-04-0-final-unity-desktop-
kubuntu-gnome-can-not-boot-from-live-us/760124
Reply
Spuddler says
2016-06-14 at 15:50
You should try to use 14.04, 16.04 still can give you lots of headaches right now.
This is how I do it: http://pastebin.com/E6uFu2Em

This will not work on 16.04 for probably hundreds of reasons

Reply
Adrian Sarno says

2016-06-12 at 18:59
I have a laptop with a NVIDIA Quadro M3000M (4.0GB) GDDR5 PCI-Express, I

would like to use it for deep learning, I noticed that no-one mentions Quadro cards
in the context of deep learning, is there a design reason why these cards are not
used in deep learning?
PS: I tried to install ubuntu (all it s versions) and it fails to show the gnome menu, it
just shows the background desktop image.
Reply
Spuddler says
2016-06-12 at 21:48
as far as I know, quadro cards are usually optimized for CAD applications, you
can use them for deep learning but they will not be as cost efficient as regular
geforce cards.
Your problem with Ubuntu not booting is a strange one, does not really look
like a graphics driver issue since you get a screen. Before googling for more
difficult troubleshooting procedures I would try other Ubuntu 14.04 LTS flavours
if I were you, like Xubuntu (windows-like, lightweight), Kubuntu (windows-like,
fancy) or even Lubuntu (very lightweight). It may just be some arcane issue with
Ubuntu’s Gnome Desktop and your hardware.
Reply
Nizam says
2016-06-10 at 11:42
This is the most informative blog about building a deep learning blog!
Thanks for that.
Now that the Nvidia’s 1080, 1070 are launched, which is a better deal for us?
two 1070s or one 1 080?
Everyone writes in the context of gamers

I badly need this communities voice here!
Reply
Epenko pentekeningen says

2016-06-09 at 23:11
Question: For budgetary reasons i’m looking at an AMD cpu / board combination
(4 cores) but that combination has no onboard video.
Can the GPU (4GB nvidia 960) which will be used for machine learning also be used
at the same time as the videocard (no 3d offcoarse).
Does that work or do i need an extra videocard ? Thanks!
Reply
Tim Dettmers says

2016-06-11 at 16:17
Yes, that will work just fine! This setup would be a great setup to get started
with deep learning and get a feel for it.
Reply
Adrian Sarno says

2016-06-09 at 18:48
Tim,
I’m looking for information on which GPU cards have support for convolutional
layers, in particular I was considering a laptop with the GTX 970, but according to
your blog above it does not support convolutional nets. Would you ind to explain
what does that mean in terms of features and also time performance? Is there a
way to know from the spec whether the card is good for conv nets?
thanks in advance
Reply
Tim Dettmers says

2016-06-11 at 16:16
Maybe I have been a bit unclear in my post. The GTX 970 supports
convolutional nets just fine, but if you use more then 3.5GB of memory you will
be slowed down. If you use 16-bit networks though you can still train relatively
well sized networks. So a GTX 970 is okay for most non-research, non-I-want-
to-get-into-top5-kaggle use-cases.
Reply
Greg says
2016-05-28 at 08:35
Hey Tim…quick question. Do you have any opinion about the new GeForce GTX
1080s for deep learning?
Maybe you already give your opinion but I have missed it.
Thanks,
Greg
Reply
Thomas R says
2016-05-19 at 15:24
Hi Tim, did you connect your 3 monitors to the mainboard/CPU or to your GPU?
Does this have an influence on the deep learning computation?
Reply
Tim Dettmers says

2016-05-26 at 11:08
I connected them to two GPUs. It does not really affect performance (maybe 1-
3% at most), but it does take up some memory (200-500MB). But overall this
effect is neglectable.
Reply
DD Sharma says
2016-05-13 at 15:15
Hello Tim,
Comparing two cards for GPGPU (Deep Learning being an instance of a GPGPU)
what is more important: # of cores or memory? For learning purposes and may be
some model dev I am considering a low end card (512 cores, 2GB) .. will this
seriously cripple me? Other than giving-up performance gains, will it seriously be
constraining? I checked research work of folks from 5+ years ago and many in
academia used processors with even weaker specs and still got something done.
Once I discover that I am doing something real serious I can go to Amazon cloud
or get an external GPU (connect via Thunderbolt 3) or build a machine.
Reply
Tim Dettmers says

2016-05-26 at 11:12
Neither cores nor memory is important per se. Cores do not matter really.
Bandwidth is important and FLOPS second most important. You need a certain
memory to training certain networks. For state of the art models you should
have more than 6GB of memory.
Reply
Bob Sperry says

2016-05-10 at 00:57
Hi Tim,
I suppose this is echoing Jeremy’s question, but is there any reason to prefer a Titan
X to a GTX 1080 or 1070? The only spec where the Titan X still seems to perform
better is in memory (12 GB vs. 8 GB).
I got a Titan X on Amazon about 2.5 weeks ago, so have about 10 days to return it
for a full refund and try for a GTX 1080 or 1070. Is there any reason not to do this?
Reply
Tim Dettmers says

2016-05-10 at 12:58
No performance data is currently in deep learning is currently available for the

GTX 1000s, but it is rather safe to say that these will yield much better
performance. If you use 16bit, and probably most libraries will change to that
soon, you will see in increase of at least 2 times in performance. I think
returning your Titan X is a good idea.
Reply
Spuddler says
2016-06-11 at 17:49
Just wanted to add that Nvidia artificially crippled the 16bit operation on
the 1070/1080 GTX to abysmal speeds, so we can only hope they don’t do
the same with the Pascal Titan card.
Reply
Jerry says
2016-05-08 at 05:20
Hi Tim. Thanks for an excellent guide! I was wondering what your opinion is on
Nvidia’s new graphics card – Nvidia Geforce GTX 1080. The performance is said to
beat the Titan X and is proposed to be half the price!
Reply
Gilbert says
2016-05-07 at 15:50
Hi, does the number of CUDA core matter? GTX 1080 will be released already and it
has 2500 CUDA cores whereas a GTX 980 TI has about 2800 CUDA cores. Will this
affect the speed of training? Or In general GTX 1080 will be faster with is 8 teraflops
of performance?
Reply
Tim Dettmers says

2016-05-08 at 15:07
The number of cores does not matter really. It all depends how these cores are
integrated with the GPU. The GTX 1080 will be much faster than the GTX Titan
X, but it is hard to say by how much.
Reply
Gilbert says
2016-05-09 at 19:07
So you’d recommend that I invest myself in a GTX 1080 instead?
Reply
Daniel Rich says

2016-05-07 at 06:23
So reading this post that bandwidth is the key limiter makes me think the gtx 1080
with a bandwidth of 320 will be slightly worse for deep learning than a 980 to. Does
that sound right?
Reply
Tim Dettmers says

2016-05-08 at 15:01
You cannot compare the bandwidth of a GTX 980 with the bandwidth of a GTX
1080 because the two cards use different chipsets. The GTX 1080 will definitely
be faster.
Reply
DD Sharma says
2016-05-05 at 01:39
Tim,
Any updates to your recommendations based on Skylake processors and specially

Quadro GPU’s?
Reply
Tim Dettmers says

2016-05-08 at 15:00
Skylake is not need and Quadro cards are too expensive — so no changes to
any of my recommendations.
Reply
Lucian says
2016-04-30 at 01:19
Hi Tim, great post!

Could you talk a bit about having different graphics cards in the same computer?
As an extreme example, would having a Titan X, 980 Ti and a 960 be problematic?
Reply
Dorje says
2016-04-24 at 15:29
Thank you very much, Tim.
I got a Titan X, hahaha~
Cheers,
Dorje
Reply
Eduardo says
2016-04-24 at 10:02
Hi, I am a Brazilian student, so everything is way too expensive for me. I will buy a
gtx 960 and start of with a single GPU and expand later on. The problem is that
intel CPUs with 30+ lanes are WAY too expensive. So I HAVE to go with AMD, but
the motherboards for AMD only have PCIe 2.0.
My question is: can I get a good performance out of 2 x 960 GPUs on a PCIe 2 .0
x16 mobo? By good I mean equal to a single 960 with x16 on a PCIe 3.0, maybe
even a single gtx 980.
Reply
Tim Dettmers says

2016-04-24 at 13:23
Hi, both a Intel CPU with 16 lanes or less (as long as your motherboard
supports 2 GPUs) as well as AMD with PCIe 2.0 will be fine. You will not see
large decreases in performance. It should be about 0-10% depending on task
and deep learning software.
If you are short on money it might also be an option to use AWS GPU
instances. If you do not train every day this might be cheaper in the end.
However, for tinkering around with deep learning a GTX 960 will be a pretty
solid option.
Reply
Raj says
2016-04-18 at 15:12
Thanks for the great blog, i learned a lot.

For me getting a 40 lane or even 28 lane CPU-MB is out of budget. In my country
these parts are rare.
I am planning to get a 16 lane CPU. With this i can get MB which has 2xPCIe 3.0
x16. I plan to use single GPU initially. If i want to use 2 GPU’s it has to be x8/x8
configuration. With this configuration is it practical to use 2 GPU’s in the future?
My system will likely have i7 6700, Asus Z170-A and Titan X.
Cheers,
RK
Reply
Tim Dettmers says

2016-04-18 at 19:55
Hi RK,
16 lanes should still work good with 2 GPUs (but make sure the CPU supports
x8/x8 lanes — I think every CPU does, but I never used them myself ). The
transfer to the GPU will be slower, but the computation on the should still be as
fast. You probably see a performance drop of 0-5% depending on the data that
you have.
Reply
RK says
2016-04-18 at 20:57
Thanks for the fast reply.
Reply
Tim Dettmers says

2016-04-19 at 19:24
You are welcome
Reply
Yi says
2016-04-14 at 07:52
Hi Tim,
Thanks for the great post. Sorry to bother you again. I just want to ask sth about
coolbits option of the GPU cards. Right now, I set it to 12 and I can manually control
the fan speed. It works nicely. But I won’t check the temperature all the time and
change the fan speed accordingly. So during training, how much percentage of fan
speed should I use? 50%, or 80% or an aggressive 90% maybe? Thanks a lot.
And if I keep the fan always running at 80% speed, will it reduce the lifecycle of the
card? Thanks.
Reply
Tim Dettmers says

2016-04-24 at 07:56
The life expectancy of the card will increase the cooler you keep it. So if you
can you can keep the fan at 100% at all times. However, this of course can
problems with noise if the machine is nearby you or other people. For my
desktop I keep the fan as low as possible to keep the GPU below 80 degrees C
and if I leave the room I just set the fan speed to 100%.
Reply
Yi says
2016-04-25 at 06:25
Thanks a lot for your reply, it helps a lot.
Reply
Spuddler says
2016-06-11 at 17:41
Keep in mind that running your fans at 100% constantly will wear out
the fans much faster – although that is better than a dead GPU chip. It
can be difficult to find cheap replacement fans for some GPUs, so you
should look for cheap ones on alibaba etc. and have a few spares lying
around in advance since shipping from china takes weeks.
Also, when a fan stops running smoothly, you can usually just buy
cheap “ball bearing oil” ($4 on ebay or so) and remove the sticker on
the front side of the fan. There will be some tiny holes beneath into
which you can simply squirt some of the oil and most likely the fan will
run as good as new. Worked out for me so far
Reply
Dorje says
2016-04-09 at 16:41
Hi Tim, THANKS for such a great post! and all these responses!
I got a question:
What if I buy a TX 1 instead of buying a computer ?
I will do video or CNN images classification sort things.
Cheers,
Dorje
Reply
Tim Dettmers says

2016-04-24 at 08:03
Hi Dorje,
I also thought about buying a TX1 instead of a new laptop, but then I opted
against it. The overall performance on the TX1 is great for a small, mobile,
embedded device, but not so great compared to desktop GPUs or even laptop
GPUs. There might also be issues if you want to install new hardware because it
might not be supported by the Ubuntu for Tegra OS. I think in the end the
money is better spend to get a small, cheap laptop and buy some credit for
GPU instances on AWS. Soon there will also be high performance instances
(featuring the new Pascal P100), so this would also be a good choice for the
future.
Reply
Chip Reuben says

2016-04-08 at 19:16
My guess is that (if done right) the monitor functionality gets relegated to the
integrated graphics capability of the motherboard. Just don’t try to stream high-res.
video while training an algorithm.
Reply
Steven says
2016-04-09 at 03:06
Ooops – I should have mentioned that the motherboard I’m using is an ASRock
Fatal1ty X99 Professional/3.1 EATX LGA2011-3. It doesn’t have an integrated
graphics chip.
Reply
Steven says
2016-04-08 at 05:23
Hi Tim,
This post was amazingly useful for me. I’ve never built a machine before and this
feels very much like jumping in the deep end. There are two things I’m still
wondering about:
1. If I’m using my GPU(s) for deep learning, can I still run my monitor off of them? If
not should I get some (relatively) cheap graphics card to run the monitor, or do
something else?
2. Do you have any opinion about Intel’s i7-4820K CPU vs. the i7-5820K CPU?
There seems to be a speed vs. cache size & cores trade-off here. My impression is
that whatever difference there is will be small, but the larger cache size should lead
to fewer cache misses, which should be better. Is this accurate?
Thanks
Reply
Steven says
2016-04-09 at 15:46
Was just reading through the Q/A’s here and saw your response to Rohit
Mundra (2015-12-22) answered my first question.
Sorry for the repeat….
Reply
Tim Dettmers says

2016-04-24 at 08:04
No problem, I am glad you made the effort to find the answer in the
comment section. Thanks!
Reply
Matt says
2016-04-05 at 04:49
Everyone seems to be using an Intel CPU, but they seem prohibitively expensive if
actual clock speed or cache isn’t that important… Would an AMD cpu with 38 lane
support work just as well paired with two GPUs?
Also, have you experimented with builds using two different GPUs?
Reply
Tim Dettmers says

2016-04-05 at 20:09
Yes, a AMD CPU should work just as well on 2 GPUs as an intel one. However,
using two different GPUs will not work if the have different chipsets (GTX 980 +
GTX 970 will not work); what will work if you have different vendors (EVGA GTX
980 + ASUS GTX 980 will work with no problems).
Reply
Matt says
2016-04-05 at 20:48
I see – thanks! I’m considering just getting a cheaper gpu to at least get my
build started and running and then picking up a Pascal gpu later. My plan
was to use the cheaper gpu to drive a few monitors and use the Pascal
card for deep learning. That kind of setup should be fine right? In other
words, there is only an issue with two different cards if I try to use them
both in training, but I’m essentially using just a single gpu for it
Reply
David Laxer says

2016-04-04 at 10:56
Hi,
Thanks for this post. Are there any Cloud solutions yet?
I used Amazon g2.2xlarge as well as g2.8xlarge as Spot Instances,
however, the GPUs are old, don’t support the latest CUDA features and spot prices
have increased.
Reply
Tim Dettmers says

2016-04-04 at 15:51
There are also some smaller providers for GPUs but their prices are usually a bit
higher. Newer GPUs will also available via Microsoft Azure N-series sometime
soon, and these instances will provide access to high-end GPUs (M60 and K80).
I will look into this issue in the next week when I will update my GPU blog post.
Reply
David Laxer says

2016-04-06 at 18:10
Can you recommend a good box which supports:

1. multiple GPUs for deep learning (say the new Nvidia GP100),
2. additional GPU for VR headset,
3. additional GPU for large monitor?
Thanks!
Reply
Xiao says
2016-04-04 at 03:48
Hi Tim,
Thanks for the post! Very helpful. Was just wondering what editor (monitor in the
center) did you use in the picture showing the three monitors?
Reply
Tim Dettmers says

2016-04-04 at 07:46
That is an AOC E2795VH. Unfortunately they are not sold anymore. But I think
any monitor with a good rating will do.
Reply
Razvan says
2016-03-31 at 18:21
Hey Tim,
Awesome article. Was curious whether you have an opinion on the Tesla M40 as
well.
Looks suspiciously similar to the Titan X.

Think the “best DL acceleration” claim might be a bit of a marketing gamble?
Cheers,
–Razvan
Reply
Tim Dettmers says

2016-04-02 at 09:40
This post is getting slowly outdated and I did not review the M40 yet — I will
update this post next week when Pascal is released.
To answer your question, the Titan X is still a bit faster with 336 GB/s while the
M40 sports 288 GB/s. But the M40 has much more memory which is nice. But
both cards will be quite slow compared to the upcoming Pascal.
Reply
Chip Reuben says
2016-04-04 at 03:26
Wow, I am super glad I read this response. Based on your comment about
the Pascal vs. the Titan X, I was able to place the development of my
system on hold, just in time! I was going to get a Titan X. But now I will
want to know if it will be much better to get the Pascal with 32 GB of
dedicated RAM (VRAM?) vs. the 12 GB of the Titan
X. http://www.pcworld.com/article/2898175/nvidias-next-gen-pascal-gpu-
will-offer-10x-the-performance-of-titan-x-8-way-sli.html
Do you have specific information that suggests it will be one week yet
before the Pascals will be available? How much do you thing the 1080 will
be (in USD, Euros, etc.)?
Reply
Chip Reuben says

2016-04-06 at 00:45
The Pascal P100 won’t even be available to most of us until later this year at
the soonest (http://wccftech.com/nvidia-pascal-gpu-gtc-2016/) and it isn’t
even in the same league as the Titan X. They haven’t said anything about
the 10xx’s, so I’m assuming they will be quite a while yet also?
Reply
Chip Reuben says

2016-03-27 at 21:48
Thanks for the great answers. Do you think that one Titan 12 GM of memory is
better than, say, two GTX 980s, or two of the upcoming Pascals (xx80s)? I currently
have a system designed that has a motherboard such that has the additional PCIe
lanes but that (as I’ve been told by the Puget Systems people) adding a second
GPU would slow down things by 2x. So I thought “just get the Titan w/ 12 GB of
memory and be done with it.” Do you think that sounds ok? Or do I upgrade the
motherboard? I’m thinking that the Titan may be more than I ever need, but
unfortunately I do not know. Thank you for your great help and thorough work.
Reply
Yi Zhu says
2016-03-26 at 08:34
Hi Tim,
Thanks for the great post. I am a graduate student, and would like to put together
a machine recently. But if I put up a system with i7-5930K CPU, Asus X-99 deluxe
MOBO and two titan x GPUs for now, will the pascal GPUs compatible with this
configuration? Can I just simply plug in a Pascal GPU when it is released? Thanks a
lot.
Reply
Tim Dettmers says

2016-03-27 at 15:30
As far as I understand there will be two different versions of the NVLink

interface, one for regular cards and one for workstations. I think you should be
alright with your hardware, although you might want to wait for a bit since
Pascal will be announced soon and probably ship in May/June.
Reply
Hehe says
2016-03-20 at 12:44
Why is the aws g2.8x not enough?

It says 60gibs(approx 64gbs) of gpu memory
Thanks
Reply
Tim Dettmers says

2016-03-20 at 22:00
The 60GB refers to the CPU memory that the AWS g2.8x has. The GPU
memory is 4GB per card.
Reply
Chip says
2016-03-20 at 06:41
“CPU and PCI-Express. It’s a trap!”

I have no idea what that is supposed to mean. Does that mean I avoid PCI express?
Or just certain Haswells? What is the point here?
Reply
Tim Dettmers says

2016-03-20 at 21:59
Certain Haswells do not support the full 40 PCIe lanes. So if you buy a Haswell
make sure it support it if you want to run with multiple GPUs.
Reply
Phong says
2016-03-17 at 23:55
You say GTX 680 is appropriate for convnets, however I see GTX 680 just has 2GB
RAM which is inadequate for most convnets such as AlexNet and of course VGG
variants.
Reply
Tim Dettmers says

2016-03-18 at 13:15
There is also a 4GB GTX 680 variant which is quite okay. Of course a GTX 980
with 6GB would be better, but it is also way more expensive. However, I would
recommend one GTX 980 over multiple GTX 680. It is just not worth the trouble
to parallelize on these rather slow cards.
Reply
Chip says
2016-03-16 at 11:05
Hi Tim,
Thanks for this excellent primer. I am trying to get a part set and have this so far
(http://pcpartpicker.com/p/JnC8WZ) but it has some 2 incompatibility issues.
Basically, I want to be working through this 2nd Data Science Bowl
(https://www.kaggle.com/c/second-annual-data-science-bowl) as an exercise. I will
likely work with a lot of medical image data. Also, I will use this system as an all-
purpose computer too (for medical writing), so I’m wondering if I also need to add
the USB, HDMI, and DVI connects (I currently also use an Eizo ColorEdge CG222W
monitor). Also, I like the idea of 2 hard drives, one for Windows and one for
Linux/Ubuntu (or I could partition?) Finally, I use a wireless connect, hence that
choice. I would be most grateful if you could help with the 2 incompatibilities, any
omissions, and seeing if this system would generally be ok. Thank you in advance
for your time.
Reply
Tim Dettmers says

2016-03-18 at 13:22
You can resolve the compatibility issue by choosing a larger mainboard. A

larger mainboard should give you better RAM voltage and also fixes the PCIe
issue. Although the GTX 680 might be a bit limiting for training state of the art
models, it is still a good choice to learn on the Data Science Bowl dataset. Once
Pascal hits the market you can easily upgrade and will be able to train all state-
of-the-art networks easily and quickly.
Reply
Chip says
2016-03-20 at 05:15
Thank you for this response. I had the GTX 980 selected (in the
pcpartpicker permalink), but I may well just wait for the Pascal that you
suggested. I read this article (http://techfrag.com/2016/03/18/nvidia-
pascal-geforce-x80-x80ti-gp104-gpu-supports-only-gddr5-memory/),
however, and suppose I must admit I’m quite confused with the names, the
relationship of “Pascal” to GeForce X80, X80Ti & Titan Specs, and also the
concern with respect to GDDR5 vs. GDDR5X memory. Is it worth it to wait
for one of the GeForce (which I assume is the same as Pascal?) rather than
just moving forward with the GTX 980? Will one save money by way of
sacrificing something with respect to memory? Please forgive my neophyte
nature with respect to systems.
Reply
Tim Dettmers says

2016-03-20 at 22:02
Pascal will be the new chip from NVIDIA which will be released in a few
months. It should be designated as GTX 10xx. The xx80 refers to the
most powerful GPU consumer model of a given series, e.g. the GTX
980 is the most powerful the 900s series. The GTX Titan is usually the
model for professionals (deep learning, computer graphics for industry
and so forth).
And yes I would wait for Pascal rather than buy a GTX 980. You could
buy a cheap small card and sell it once Pascal hits the market.
Reply
Wajahat says
2016-03-07 at 13:57
Hi Tim
Thanks a lot for your article. It answered some of my questions. I am actually new
to deep learning and know almost nothing of GPUs. But I have realized that I need
one. Can you comment on the expected speedup if I use ConvNets on a Titan X
than a n intel corei7 4770-3.4 Ghz?
Even a vague figure would do the job.
Best Regards
Wajahat
Reply
Tim Dettmers says

2016-03-07 at 14:04
It depends highly on the kind of convnet you are want to train, but a speedup
of 5-15x is reasonable. However, if you can wait a bit I recommend you to wait
for Pascal cards which should hit the market in two months or so.
Reply
viper65 says
2016-02-23 at 18:15
Thank u. But consider the size of the memory and the brand, I am afraid the price
of pascal would far beyond my budget?
Reply
viper65 says
2016-02-22 at 22:20
Nice article!
What do you think about HBM? Apart from the size of ram, do you think that fury x
has any advantage comparing to 980Ti?
Reply
Tim Dettmers says

2016-02-23 at 13:05
The Fury X definitely has the edge over the GTX 980 Ti in terms of hardware,
though in terms of software the AMD still lags behind. This will change quite
dramatically once NVIDIA Pascal hits the marked in a few months. HBM is
definitely the way to go to get better performance. However, the HBM of
NVIDIA offers double the memory bandwidth from the Furx X and Pascal will
also allows for 16-bit computations which effectively doubles the performance
further. So I would not recommend getting a Fury X, but instead to wait for
Pascal.
Reply
Bobby says
2016-02-23 at 21:37
How soon do you think will flagship of Pascal, like Titan X, be on the
market? I am not sure if I should wait. Thank you.
Reply
hroent says
2016-08-12 at 02:34
Hi Tim — Thanks for this article, I’ve found it extremely useful (as have
others, clearly).
You’re probably aware of this, but the new Titan X Pascal cards have very
weak FP16 performance.
Reply
Tim Dettmers says

2016-08-13 at 21:56
Yes the FP16 performance is disappointing. I was hoping for more, but I
guess we have to wait until Volta is released next year.
Reply
Freddy says
2016-02-08 at 14:15
Hey Tim,
first of all thank you very much for your great article. It helped me alot to gain
some inside in the hardware requirements needed for any DL machine. Over the
past several years i only worked with laptops (in freetime) as i had some good
machines at work. Now i am planning to set up some system at home to start
experimenting on some stuff in my free time. After i read your post and many of
the comments i started to create a build (http://de.pcpartpicker.com/p/gdNRQ7)
and as you looked over so many systems and gave advices i hoped that you can
maybe do it once again
I choosed the 970 as a starter, and then wait for the pascal cards comming out later
this year. I am also not planning to work with more than 2 gpus in the future at
home. And for the monitor. i already have one 24″ at home, so this will just be the
2nd.
I dunno, maybe you can look over it and give me some advices or your opionon.
Reply
Tim Dettmers says

2016-02-09 at 14:20
Looks like a solid build for a GTX 970 and also after an upgrade to one or two
Pascals this is looking very good.
Reply
Freddy says
2016-02-09 at 15:48
Thanks for the time you are spending, giving so many people advices. It
is/was quite hard for me after so many years of laptop use to dive back
into hardware specifics. You made it a lot easier with your post. Big thanks
again!
Reply
Lawrence says
2016-02-06 at 22:36
Hi Tim,
Great website ! I am building a Devbox, https://developer.nvidia.com/devbox.
My machine has 4 Titan X cards, Kingston Digital HyperX Predator 480 GB PCIe
Gen2 x 4 , Intel Core i7-5930K Haswell-E, and G.SKILL 64GB. I am using ASUS
RAMPAGE V extreme motherboard. When I place the last Titan X card on the last
slot, my SSD gets disapered from bios. I am not sure I have a PCIe conflict ? Does
M.2 can interfere with PCIE_X8_4. What should I do to fix this issue ? Should I
change the motherboard, any advice ?
Reply
Tim Dettmers says

2016-02-07 at 11:46
Your motherboard only supports 40 PCIe lanes, which is standard, because

CPUs only support a maximum of 40 PCIe lanes. Your 4 Titan X will run in
16x/8x/8x/8x lane mode. You might be able to switch the first GPU to 8x
manually, but even then CPUs and motherboards usually do not support a
8x/8x/8x/8x/8x mode (usually two PCIe switches are supported for a single
GPU, and a single PCIe switch supports two devices, so you can only run 4 PCIe
devices in total). This means that there is probably no possibility to get your
PCIe SSD working with 4 GPUs. I might be wrong. To check this it is best to
contact your ASUS tech support and ask them if the configuration is possible or
not.
Reply
Bobby says
2016-02-19 at 07:07
Hi Tim,
Thank you for the wonderful guide.
As Lawrence, I’m also building a GPU workstation

using https://developer.nvidia.com/devbox as the guide. It mentions that
“512GB PCI-E M.2 SSD cache for RAID”. I wonder how to setup this SSD as
the cache for RAID, since RAID 5 does not support this as I know. Have you
done anything similar? Thank you very much.
Reply
Tim Dettmers says

2016-02-19 at 16:00
Hi Bobby,
I have no experience with RAID 5, since usual datasets will not benefit
from increased read speeds as long as you have a SSD. I think you will
need to change some things in BIOS and then setup a few things for
your operating system with a raid manager. I think you will be able to
find a tutorial for your OS online so you can get it running.
Reply
Bobby says
2016-02-21 at 01:12
Hi Tim,
It seems it’s not related to the RAID. I wonder how to setup an SSD
as the cache for a normal HDD. Setting it as the cache for RAID
should be similar. With this, I may not need to manually copy my
dataset for HDD to SSD before experiment. Thank you.
Alex Blake says

2016-01-19 at 07:19
Hi Tim:
Thanks so much for sharing your knowledge!
I’ve seen you mentioned that Ubuntu is a good OS..
what is the best OS for deep learning?
What is a good alternative to Ubuntu?
I’d really appreciate your thoughts on this…
Reply
Tim Dettmers says
2016-01-25 at 14:08
Linux based system are currently best for deep learning since all major deep
learning software frameworks support linux. Another advantage is, that you will
be able to compile almost anything without any problems while on other
systems (Mac OS, Windows) there will always be some problems or it may be
nearly impossible to configure a system well.
Ubuntu is good, because it is widely used, easy to install and configure, and it
has some support for their LTS versions which makes it attractive for software
developers which target linux systems. If you do not like Ubuntu you can use
Kubuntu, or other X-buntu variants; if you like a clean slate and to configure
everything they way you like I recommend Arch Linux, but be beware that it will
take a while until you configured everything the way it is suitable for you.
Reply
JB says
2016-01-10 at 22:01
Tim,
First of all, thank you for writing this! This post has been extremely helpful to me.
I’m thinking about getting a gtx 970 now and upgrading to pascal when it comes
out. So, if I never use more than 3.5gb vram at a time, then I won’t see
performance hits, correct? I’m building my rig for deep reinforcement learning
(mostly atari right now), so my minibatches are small (<2MB), and so are my
convnets (<2mill weights). Should I be fine until pascal?
I'm trying to decide between these two budget builds: [Intel Xeon e5]
(http://pcpartpicker.com/p/dXbXjX) and [Intel i5]
(http://pcpartpicker.com/p/ktnHdC). I'm thinking about going with the Xeon, since
it has all 40 pcie lanes if I wanted to do more than two gpus in the future, and it's a
beefier processor. However, I start grad school in the fall, so I'd have university
hardware then, and think I'd be more than fine with two gpus for personal
experiments in the future. (Or could 4 lanes be enough bandwidth for a gpu?) If I
get the i5 I could upgrade the processor without having to upgrade the
motherboard if I wanted. The processor just needs to be good enough to run
(atari) emulations and preprocess images right now. I can't really imagine anything
but the GPU being the bottleneck, right?
Thank you for the help. I'm trying to figure out something that will last me awhile,
and I'm not very familiar with hardware yet.
Thanks again,
– JB
Reply
Tim Dettmers says

2016-01-25 at 14:04
Hi JB,
the GTX 970 will perform normally if you stay below 3.5GB of memory. Since
your mini-batches are small and you seems to have rather few weights this
should fit quite well into that memory. So in your case the GTX 970 should give
you optimal cost/performance.
Reply
Rohit Mundra says

2015-12-22 at 02:48
Hey Tim,
Thanks for the great article; I have a more specific question though – I’m building
an entry-level Kaggle-worthy system using an i7-5820K processor. Since I want to
keep my GTX 960’s 4GB memory solely for deep learning, would you recommend I
buy an additional (cheaper) graphic card for display or not? I’m considering the GT
610 for this purpose since it’s cheap enough. Also, if I were to do this, where would I
specify such a setting (e.g. use GT 610 for display)?
Thanks again!
Rohit
Reply
Tim Dettmers says

2015-12-22 at 14:06
For most datasets on Kaggle your GPU memory should be okay and using
another small GPU for you monitors will not do much. However, if you are
doing one of the deep learning competitions and you find yourself short on
memory and you think you could improve your score by using a model that is
a bit larger then this might be worth it. So I would only consider this option if
you really encounter problems where you are short on memory.
Also remember that the memory requirements of convolutional nets increases

most quickly with the batch-size, so going from a batch-size of 128 to 96 or
something similar might also solve memory problems (although this might also
decrease your accuracy a bit, its all quite dependent on the data set and
problem). Another option would be to use the Nervana system deep learning
libraries which can run models in 16-bit, thus halving the memory footprint.
Reply
Fusiller says
2015-11-29 at 18:16
Just a quick note to say thank you and congrats for this great article.
Very nice of you to share your experience on the matter.
Regards.
Alex
Reply
Tim Dettmers says

2015-11-30 at 09:41
Thank you! I am happy that you found the article helpful!
Reply
Eystein says
2015-11-19 at 19:26
Hello! First off, I just want to say this website is a great initiative!
I’m going to use Kaldi for speech recognition the next spring in my master thesis.
Not knowing exactly what type of DNNs I’ll be implementing, I’m planning for an
allround solid, budget GPU. Is the GTX 950 with 2 GB suitable (I haven’t seen this
mentioned here)? It only requires a 350 W PSU, which is why I’m considering it. Also
I have a Q6600 CPU and a motherboard that has 4 GB RAM as a max, so this is a
bit constraining on the overall performance of this setup. And apologies if this is
too general a question. I’m just now getting into the field
Reply
Tim Dettmers says

2015-11-21 at 11:18
The GTX 950 2GB variant might be a bit short on RAM for speech recognition if
you use more powerful models like LSTMs. The cheapest solution might be to
prototype on your CPU and use AWS GPU instances to run the model if
everything looks good. This way you need no new computer/PSU and will be
able to run large LSTMs and other models. If this does not suit you, a GTX 950
with 4GB of memory might be a good choice.
Reply
Eric says
2015-10-26 at 08:29
Tim,
Thank you for the many detailed posts. I am going with a one GPU Titan X water
cooled solution based on information here. Does it still hold true that adding a
second GPU will allow me to run a second algorithm but that it will not increase
performance if only one algorithm is running? Best Regards – Eric
Reply
Tim Dettmers says

2015-10-26 at 22:16
There are now many good libraries which provide good speedups for multiple
GPUs. Torch7 is probably the best of them. Look for the Torch7 Facebook
extensions and you should be set.
Reply
BK says
2015-10-21 at 16:50
Hi Tim,
Great post; In general all of the content on your blog has been fantastic.
I’m a little curious about your thoughts on other types of hardware for use in deep
learning. I’ve heard a number of people suggest FPGAs to be potentially useful for
deep learning(and parallel processing in general) due to their memory efficiency vs.
GPUs. This is often mentioned in the context of Xeon Phi….what are your thoughts
on this? If true, where does the usefulness lie, in the ‘tracking’ or ‘scoring’ part of
deep learning(my perhaps incorrect understanding was GPUs advantage was their
use for training as opposed to scoring)?
My apologies for what I’m certain are sophomoric questions; I’m trying to wrap my
head around these matters as someone new to the subject!
Regards,
BK
Reply
Tim Dettmers says

2015-10-26 at 22:02
Nonsense, these are great questions! Keep them coming!
FPGAs could be quite useful for embedded devices, but I do not believe they
will replace GPUs. This is because (1) their individual performance is still worse
than an individual GPU and (2) combining them into sets of multiple FPGAs
yields poor performance while GPUs provide very efficient interfaces (especially
with NVLink which will be available at the end of 2016). GPUs will make a very
big jump in 2016 (3D memory) and I do not think FPGAs will ever catch up
from there.
Xeon Phi is potentially more powerful than GPUs, because it is easier to

optimize them at the low level. However, they lack the software for efficient
deep learning ( just like AMD cards) and as such it is unlikely that one will see
Xeon Phis be used for deep learning in the future (unless Intel creates a huge
deep learning initiative that rivals the initiative of NVIDIA).
Reply
BK says
2015-10-28 at 15:29
Thanks for the response! That’s very interesting.
I wanted to follow up a little bit regarding software development for

NVIDIA vs. Intel or AMD. I know how much more developed CUDA libraries
are when it comes to Deep learning than OpenCL. What frameworks can I
actually run with an intel or AMD architecture? Do torch/caffe/Theano only
work on NVIDIA hardware? Once again, my apologies if I’m fundamentally
misunderstanding something.
One last question, beyond the world of deep learning, what is the
perception of xeon phi? It seems hard to find people who are talking with
certainty as to what its strengths/applications will be. Is there any consensu
on this? what do you think makes most sense for xeon phi as an
application?
Many thanks!
-BK
Reply
Greg says
2015-10-20 at 00:45
Hey Tim…
Do you have any suggestions for a tutorial for DL using Torch7 and Theano and/or
Keras?
Thanks
Greg
Reply
Nghia Tran says

2015-10-10 at 07:59
Hi Tim,
Thank you very much for all the writting. I am an objective C developer but a brand
newbie to the deep learning thing and so interested in this area right now.
I got a Mac 3.1 and I would like to upgrade the graphic card for having CUDA to
run torch7, lua and nn as to learn about this programming. Don’t bother if this
should be a Mac card or Windows card.
Which one should you recommend? GTX 780Ti?GTX 960 2GB? GTX 980? Tesla
M2090(second hand)?
Look forward to your advice.
Reply
Tim Dettmers says

2015-10-10 at 10:40
From the cards you listed the GTX 980 will be the best by far. Please also have a
look at my GPU guide for more info how to choose your GPU.
Reply
Nghia Tran says

2015-10-29 at 16:47
Thank you very much. I got a generous sponsor to build up a new ubuntu
machine with 2 GTX 780 Ti. Should I use the GTX 980 in the new machine
to yield better performance than a SLI GTX 780 Ti or let it stay in my Mac?
Reply
Tim Dettmers says

2015-10-30 at 09:31
If you already have the two GTX 780 Ti I would stick with that and only
change/add the GPU if you experience RAM shortage for one of your
models.
Reply
Nghia Tran says

2015-10-30 at 12:02
Thank you very much Tim. I am looking forward to your further

writing.
By the way, do you have time to look at the neuro-synaptic chip
from IBM yet? Really interested in your “deep analysis” on this as
well.
Brent Soto says

2015-10-08 at 16:30
Hi Tim, The company that I buy my servers from (Thinkmate) recently sent me an
e-mail advertising that they’ve been working with Supermicro to sell servers with
support for Titan X. What do you think about this solution? I’ve had a lot of luck
with Supermicro servers, and they offer 3 year warranty on the Titans and will
match the price if found cheaper elsewhere. Here’s the
link: http://www.thinkmate.com/systems/servers/gpx/gtx-titan-x
Reply
Tim Dettmers says

2015-10-08 at 18:42
Hi Brent, I think in terms of the price, you could definitely do better on the 1U
model with 4 GTX Titan X. A normal board with 1 CPU will not have any
disadvantage compared to the 1U model for deep learning.
However, the 4U model is different because it can use 8 GTX Titan X with a fast
CPU-to-CPU switch which makes parallelization of 8 GPUs easy and fast. There
are only few solutions available that are build like this and come with 8 GTX
Titan X — so while the price is high, this will be a rather unique and good
solution.
Reply
Greg says
2015-10-08 at 05:13
Yes, I did the BIOS flash in the beginning.
Lastly, I kept testing and found the culprit….when installing Cuda I can’t install the
502 driver that it comes with or the Ubuntu system locks with an unknown
password…no matter trying a ton if different ways to install the Cuda driver. I
scoured the internet for a solution and there wasn’t one and it looks like no one has
put 2 n 2 together about the Cuda driver. It could be a combo of things both
hardware and software but it definitely involves this driver the x99 mb, a titian x and
Ubuntu 14.04 and 15.04.
Thanks.
Reply
Greg says
2015-10-06 at 21:08
Hi Tim..
Recently I have had a ton of trouble working with Ubuntu 14.04 …installing Cuda,
caffe etc. Ubuntu has password locked me out of my system twice and getting all
dependencies installed to make caffe to install has been a real problem. It works
sometimes …other times it doesn’t work. Ubuntu 14.04 is clearly an unstable OS.
I would like your opinion TIm on moving from Linux to Windows for deep learning?
What are your thoughts?
Thanks in advance…
-Greg
Reply
Tim Dettmers says

2015-10-07 at 10:34
I can feel your pain — I have been there too! Ubuntu 14.04 is certainly not
intuitive when you are switching from Windows and a simple unseemingly
command can ruin your installation. However, I found once you understand
how everything is connected in Linux things get easier, make sense, and you no
longer will run into errors which break your installations or even our OS. After
this point, programming in Linux will be much more comfortable than in
Windows due to the ease to compile and install any library. So it may be painful
but it is well worth it. You will gain a lot of if you go through the pain-mile —
keep it up!
Reply
Greg says
2015-10-07 at 18:43
After Ubuntu 14.04 locking me out 3 times via a booting up and false
logon screen… I thought I’d try Ubuntu 15.04. I think the Cuda driver
slammed Unity resetting the root password to something other than the
password I gave it. I search the web and this is a common problem and
there seems to be no fix.
I’m running x99 MB, I7 5930, 64 GB ram, and one Titan x. I’ll get a second
Titan x when I’m ready for it. I want to create my own NN and nodes but
for now I have a ton of learning to do and I need to follow what’s been
done so far.
Do you use standard libraries and algorithms like Caffe, Torch 7 and
Theano via Python? I feel I need to wade through everything to see how it
works before using it. Nvidia Digits looks pretty simple working from the
GUI but it also looks, from my limited experience, like it’s pretty limited.
Reply
Tim Dettmers says

2015-10-07 at 18:49
Is this because of your x99 board? I never had any problems like that.
As for the software, Torch7 and Theano (Keras and derivatives) works
just fine for me. I have tried Caffe once and it worked, but I also heard
some nightmare stories about installing Caffe correctly. NVIDIA Digits
will be just as you described: Simple and fast, but if you want to do
something more complex it will just be an expensive fast PC with 4
GTX Titan X.
Reply
mxia.mit@gmail.com says
2015-10-07 at 20:36
Just to tag onto this, I have an X-99 E board, and had some problems on the
initial install when trying to boot into ubuntu’s live installer, nothing with the
password though. After installing everything worked fine at the OS level. In
case this is relevant, reflashing to the latest BIOS helped a lot, but probably
won’t help your password problem.
Cheers and best of luck!
Mike
Reply
Safi says
2015-10-04 at 22:54
Hi Tim,
First thanks a lot for these interesting and useful topics. I am a PhD student i work
on Evolutionary ANNs.
I want to start using GPUs, my budget can reach 150$ Max.
I found in my town a new GTX 750 and GTX 650 Ti. Which one is better and are
they supported by cuDNN.
Thank
Reply
Tim Dettmers says

2015-10-05 at 07:45
A GTX 750 should be better, and both support cuDNN. However, I would also
suggest that you have a look at AWS GPU instances. The instance will be a bit
faster and may suit your budget well.
Reply
ML says
2015-09-28 at 17:02
Hello Tim, what about external graphic cards connected through Thunderbolt?
Have you looked at those? Could that be a cheap solution without having to
build/buy a new system?
Reply
Tim Dettmers says

2015-09-28 at 17:33
I looked at some performance reviews and they state about 70-90%

performance for gaming. For deep learning the only performance bottleneck
will be transfers from host to GPU and from what I read the bandwidth is good
(20GB/s) but there is a latency problem. However, that latency problem should
not be too significant for deep learning (unless it’s a HUGE increase in latency,
which is unlikely). So if I put these pieces of information together it looks as if
an external graphics card via Thunderbolt should be a good option if you have
an apple computer and have the money to spare for the suitable external
adapter.
Reply
Tony says
2015-09-25 at 18:29
Tim, thanks again for such a great article.
One concern that I have is that I also use triple monitors for my work setup.
However, doesn’t the fact that you’re using triple monitors effect performance of
your GPU? Do you recommend buying a cheap $50 gpu for your triple monitor
setup and then dedicating your titan x or your more expensive primarily to deep
learning? I run Recurrent Neural Nets,
Thanks!
Reply
Tim Dettmers says

2015-09-25 at 19:30
Three monitors will use up some additional memory (300-600MB) but should
not affect your performance greatly (< 2% performance loss). I recommend
getting a cheap GPU for your monitors only if you are short on memory.
Reply
Tony says
2015-09-28 at 19:00
Thanks — that makes alot of sense. I just thought it would affect your
bandwidth (as that is usually the bottleneck). I’m currently running the 980
TI — I know it has 336Gb/s. Good to know that it uses some memory
though. Appreciate it.
Reply
Michael Holm says

2015-09-23 at 21:22
Hello Tim,
Thank you for your article. The deep learning devbox (NVIDIA) has been touted as
cutting edge for researchers in this area. Given your dual experience in both the
hardware and algorithm sides, I would be grateful to hear your general thoughts on
the devbox. I know it came out a few months after you wrote your article.
Thank you!
Reply
Colin McGrath says

2015-09-08 at 18:39
I just want to thank you again Tim for the wonderful guide. I do have a couple of
hardware utilization questions though. I am trying to figure out how to properly
partition my space in ubuntu to handle my requirements. I dual boot Windows 10
(for work/school) and Ubuntu 14.04.3 (deep learning) with each having their own
SSD boot drive and HDD storage drive. For starters here’s my setup:
– ASRock X99 WS-E

– 1x Gigabyte G1 980 ti
– 16GB Corsair Vengeance RAM 2133
– i7-5930k
– 2x Samsung 850 Pro 256GB SSDs (boot drives)
– 2x Seagate Barracuda 3TB HDDs (storage drives)
My windows install is fine, but I want to be able to store currently unused data in
the HDD, stage batches in the SSD then send the batches from SSD to RAM to fully
leverage the IOPS gain in a SSD.
I currently have Ubuntu partitioned this way, however I’m not entirely sure this will
fit my needs. I’m thinking I might want to allocate /home on the HDD due to how
ubuntu handles the /home directory in the UI, but I’m unsure if that will be a
problem with deep learning:
SSD (boot):
– swap area – 16GB
– / – 20GB
– /home – 20GB
– /var – 10GB
– /boot – 512MB
– /tmp – 10GB
– /var/log – 10GB
HDD
– /store 1TB
Reply
vinay says
2015-09-08 at 15:25
Does anyone know what would be the requirements for prediction clusters? Most
articles focus on training aspects but inference/prediction is also important and
compute demand for these are little discussed. Can anyone comment on compute
demands for prediction? Also, what do you recommend, CPU only, CPU+GPU, or
CPU+FPGA, etc for such tasks?
Thanks,
Vinay
Reply
Tim Dettmers says

2015-09-08 at 17:15
It depends on many factors which is a suitable solution. If you build a web

application, how long do you want your user to wait for a prediction (response
time)? How many predictions are requested per second in total (throughput)?
Prediction is much faster than training, but still a forward pass of about 100
large images (or similar large input data) takes about 100 milliseconds on a
GPU. A CPU could do that in a second or two.
If you predict one data point at a time a CPU will probably be faster than a
GPU (convolution implementations relying on matrix multiplication are slow if
the batch sizes are too small), so GPU processing is good if you need high
throughput in busy environments, and a CPU for single predictions (1 image
should take only 100 milliseconds for a good CPU implementations). Multiple
CPU servers might also be an option, and usually they are easier to maintain
and cheaper (AWS spot instances for example, also useful for GPU work). Keep
in mind that all these these numbers are reasonable estimates only and will
differ from the real results; results from a testing environment that simulates the
real environment will make it clear if CPU servers or GPU servers will be
optimal.
I do not recommend FPGA for such tasks since over time, interfaces to FPGA
are not easy to maintain and cloud solutions do not exist (as far as I know).
Reply
Sascha says
2015-09-05 at 16:44
Hi,
thanks a lot for all this information. After stumbling across a paper from Andrew Ng
et al (“Deep learning with COTS HPC systems”) my original plan was to also build a
cluster (to learn how it is done). I wanted to go for two machines with a bunch of
GTX Titans but after reading your blog I settled with only one pc with two GTX 980s
for the time being. My first thought after reading your blog was to actually settle for
two 960s but then I thought about the energy consumption you mentioned.
Looking at the specifications of the nvidia cards I figured the 980 were the most
efficient choice currently (at least as long as you have to pay German energy
prices).
As I am still relatively fresh to machine learning I guess this setup will keep me busy
enough for the next couple of months, probably until the pascal architecture you
mentioned is available (I read somewhere 2nd half of 2016). If not then I guess I will
buy another PC and move one of the 980s into it so that I can learn how to setup a
cluster (my current goal is learning as much as fast as possible).
The configuration I went for is as follows:
CPU: Intel i7-5930k (I chose this one instead of the much cheaper 5920 as it has the
40 PCI lanes you mentioned, which gives the additional flexibility of handling 4
graphics cards)
Mainboard: ASRock Fatal1ty X99 Professional (supports up to 4 graphics cards and
has a M.2 slot)
RAM: 4×8 GB DDR4-3000
Graphics Card: 2x Zotac GTX 980 AMP! Edition
Hard Disk: Samsung SSD SM951 with 256 GB (thanks to M.2 it offers 2 GB/s of
sequential read performance)
Power Supply: be quiet! BN205 with 1200 Watts
I hope that installing Linux on the ssd works as I read that the previous version of
this ssd mad some problems.
Thanks again
Sascha
Reply
Tim Dettmers says

2015-09-06 at 06:35
Hi Sascha! Your reasoning is solid and it seems to got a good plan for the
future. Your build is good, but as you say, the PCIe SSD could be a bit
problematic to set up. Another fact to be aware of is that your GPUs will have a
slower connections with that SSD, because the SSD takes away bandwidth from
your GPUs (your GPUs will run at 16x/8x instead of 16x/16x). Overall the PCIe
SSD would be much faster for common applications, but slower when you use
parallelism on two GPUs, it might be better to go for a SATA SSD (if you do not
use parallelism that much a PCIe SSD is a solid choice). A SATA SSD will be
slower than the PCIe one, but it should be still be faster enough for any deep
learning task. However, preprocessing will be slower on this SSD, and this is
probably the main advantage of the PCIe SSD.
Reply
Sascha says
2015-09-06 at 09:49
That is an interesting point you make regarding the M.2. I did not realise
that this is how the board will distribute the lanes. I figured that as the M.2
only uses 4 lanes the two cards could each run with 16 and if I actually
decided to scale up to a quad setup each card eventually would only get 8
lanes.
My first idea after reading the comment was to just try the ssd in the
additional M.2 PCI 2.0 slot, which is basically a SATA 6 connection but that
will not work, as it will not fit because one has the Key B and the other the
Key M layout.
Do you have an idea about what this actually means for real life
performance in deep learning tasks (like x% slower)?
Greetings
Sascha
Reply
Tim Dettmers says

2015-09-06 at 10:23
When I think about it again, I might be wrong about what I just said.
How two GPUs and the PCIe SSD will work depends highly on your
motherboard and how the PCIe slots are wired and how the PCIe-
switches are distributed. I think with a 40 PCIe lane CPU and a
mainboard that supports 16x/16x/8x layout, it should be possible to
configure that to use 16 lanes for your GPUs and 8 lanes for your SSD;
to use that setup you only need to make sure to plug everything into
the right slot (your mainboard manual should state how to do this). I
have not looked at your hardware in detail, but I think your hardware
supports this.
If your motherboard does not support 16x/16x/8x, then your GPU

parallelism will suffer from that. Convolutional nets will have a penalty
of 5-15% depending on the architecture, recurrent networks may have

a low or no penalty (LSTMs) or a high penalty (20-50%) for recurrent
nets with many parameters like vanilla RNNs.
Reply
Colin McGrath says

2015-09-02 at 05:23
What are your opinions on RAID setups in a deep learning rig? Software-based
RAID is pretty crappy in my experience and can cause a lot more problems than it
solves. However, RAID controllers take a PCI-E slot which will,
fortunately/unfortunately, all be taken by 4 x Gigabyte GTX 980 TI cards. Is it worth
running RAID with the software controller? Or is it better just to do full copy clone
backups?
Reply
Tim Dettmers says

2015-09-02 at 08:26
I do not think it is worth it. Usually, a common SATA SSD will be fast enough for
most kinds of data; in come cases there will be a decrease in performance
because the data takes too long to load, but compared to the effort and
money spend on a RAID system (hardware) it is just not worth it.
Reply
mxia.mit@gmail.com says
2015-09-01 at 21:18
Hey Tim,
Thank you so much for this great writeup, it’s been pivotal in helping me and my
co-founder understand the hardware. We’re a duo from MIT currently working on a
venture backed startup bringing deep learning to education, hoping to help at least
improve, if not fix, the US education system.
Our first build is aiming to be cheap where it can (since both of us are beginners
and we need to be frugal with our funding) but future proof enough for us to do
harder things.
My current build consists of these parts:
Mobo: Asus X99-E WS SSI CEB LGA2011-3 Motherboard

CPU: Intel Core i7-5820K 3.3GHz 6-Core Processor
Vide Card: EVGA GeForce GTX 960 4GB SuperSC ACX 2.0+ Video Card
PSU: EVGA 850W 80+ Gold Certified Fully-Modular ATX Power Supply
RAM: Corsair Vengeance LPX 16GB (2 x 8GB) DDR4-3000

Storage: Sandisk SSD PLUS 240GB 2.5″ Solid State Drive
Case: Corsair Air 540 ATX Mid Tower Case
Could you look over these and offer any critique? My logic was to have a Mobo
and CPU that could handle upgrading to better hardware later, things like the PSU,
Ram, and the 960 I’m willing to replace later on.
Thank you in advance! Also is there a way we could exchange emails and chat
more?
Would love any advice we can get from you while we build out our product.
Best,
Mike Xia
Reply
Tim Dettmers says

2015-09-02 at 08:31
Looks good. The build is a bit more expensive due to the X99 board, but as
you said, that way it will be upgradeable in the future which will be useful to
ensure good speed of preprocessing the ever-growing datasets. You are
welcome to send me an email. My email is firstname.lastname@gmail.com
Reply
Carles Gelada says

2015-08-27 at 17:35
I have been looking for an affordable CPU with 40 lanes without luck. Could you
give me with a link?
I am also curious about the actual performance benefit of 16x vs 8x. If the
bottleneck are the DMA writes will the performance reduce by halve?
Reply
Tori says
2015-08-18 at 00:38
Thank you so much for such informative article!
How would GTX Titan Z compare to GTX Titan X for the purpose of training a large
CNN? Do you think it’s worth the money to buy a GTX Titan Z or is a GTX Titan X
good enough? Thanks!
Reply
Tim Dettmers says

2015-08-18 at 05:34
A GTX Titan X will be much better in most cases. If you want more details have
a look at my answer about this equation on quora.
Reply
Peter says
2015-08-13 at 16:22
Hi Tim,
Firstly, thanks for this article; it’s extremely informative (in fact your entire blog
makes fascinating reading for me, since I’m very new to neural networks in
general).
I want to get a more powerful GPU to replace my old Gtx 560 Ti (a great little card,
but 1gb of memory is really limiting and I presume it’s pretty slow these days too).
Sadly I cannot really afford the GTX Titan X (as much as I’d like to, 1300 CAD is too
damn high). The 980 Ti is also a bit on the high end, so I’m looking at the 980, since
it’s about 200 CAD cheaper. My question is; how much performance am I gaining
from my old 560 Ti to a 980/980 Ti/Titan X? Is the difference in gained speed even
that large? If it’s worth saving for the bigger card then I’ll just have to be patient.
I’m currently running torch7 and a LSTM-RNN with batches of text, not images, but
if I want to do image learning I assume I’d want as much RAM as possible?
Cheers
Reply
Tim Dettmers says

2015-08-16 at 09:39
The speedup should be about 4x when you go from a GTX 560 Ti to a GTX
980. The 4GB ram on the GTX 980 might be a bit restrictive for convolutional
networks on large image datanets like ImageNet. A GTX Titan X or GTX 980 Ti
will only be 50% faster than a GTX 980. If you wait about 14-18 months you can
get a new Pascal card which should be at least 12x faster than your GTX 560 Ti.
I personally would value getting additional experience now as more important
than getting less experience now and faster training in the future — or in other
words, I would go for the GTX 980.
Reply
Peter says
2015-08-17 at 15:36
How exactly would I be restricted by the 4GB of ram? Would I simply not
be able to create a network with as many parameters, or would there be
other negative effects (compared to the 6GB of the 980 Ti)?
You’ve mentioned in the past that bandwidth is the most important aspect
of the cards, and the 980 Ti has 50% higher bandwidth than the regular
980; would that mean it’s 50% faster too, or are there other factors
involved?
Reply
Tim Dettmers says

2015-08-17 at 16:56
Yes, thats correct, if your convolutional network has too many

parameters it will not fit into your RAM. Other factors besides memory
bandwidth only play a minor role, so indeed, it should be about 50%
better performance (not the 33% I quoted earlier (I edited this for
correctness just now)).
Reply
howtobeahacker says
2015-08-13 at 07:41
Hi Tim,
I have a minor question related to 6-pin and 8-pin power connector. It is related to
your sentence “One important part to be aware of is if the PCIe connectors of your
PSU are able to support a 8pin+6pin connector with one cable”.
My workstation has one 8-pin cable to TWO 6-pin cable connectors. Is it possible
that we plug into these two 6-pin connectors to power up Titan X which requires 6-
pin and 8-pin power connectors? I think I will try it, because I want to plug 2 GPUs
Titan X and only this way my workstation can support up two GPUs.
Thank you so much.
@An
Reply
Tim Dettmers says

2015-08-13 at 08:13
I think this will depend somewhat on how the PSU is designed, but I think you
should be able to power two GTX Titan X with one double 6-pin cable, because
the design makes it seem that it was intended for just that. Why would they put
two 6-pin connectors on a cable if you cannot use them? I think you can find
better information if you look up your PSU and see if there is a documentation,
specification or something like that.
Reply
howtobeahacker says
2015-08-12 at 05:09
Hi Tim,
Thank for your responses. I read your posts and I remembered an image of a
software in Ubuntu to visualize states of GPU. Something that is similar to Task
Manager for CPUs. If you have information, please let me know.
Reply
Florijan Stamenković says

2015-08-11 at 12:34
Hi Tim!
We’ve already asked you for some advice, an it was helpful… We put together a
dev-box in the meanwhile, with 4 Titans inside, it works perfectly.
Now we are considering production servers for image tasks. One of them would be
classification. Considering the differences between training and runtime (runtime
handles a single image, forward prop only), we were wondering if it would be more
cost effective to run multiple weaker GPUs, as opposed to fewer stronger ones….
We are reasoning that a request queue consisting of single-image tasks could be
processed faster on two separate cards, by two separate processes, then on a
single card that is twice as fast. What are your thoughts on this?
We’ve run very crude experiments, comparing classification speed of a single image
on a Titan machine, vs. 960M equipped laptops. The results were more or less as
we expected: Titans are faster, but only about 2x, whereas Titans are 4x more
expensive then a GTX960 (which has significantly more GFLOPS then the 960M). In
absoulte terms, classification speed on a weaker card is acceptable, we’re
wondering about behavior under heavy load.
Reply
Tim Dettmers says

2015-08-11 at 14:01
Hi Florijan!
I think in the end this is a numbers game. Try to overflow a GTX 960M and a
Titan with images and see how fast they go and compare that with how fast
you need to be. Additionally, it might make sense to run the runtime
application on CPUs (might be cheaper and more scalable to run them on AWS
or something) and only run the training on GPUs. I think a smart choice will
take this into account, and how scalable and usable the solution is. Some AWS
CPU spot instances might be a good solution until you see where your project
is headed to (that is if a CPU is fast enough for your application).
Reply

2015-08-11 at 14:09
Tim,
Thanks for your reply. You’re right, it definitively is a numbers game, I guess
we will simply need to stress-test.
We already tried to run our classifier on the CPU, but classification time was
an order of magnitude slower then on the 960M, so that doesn’t seem a
good option, especially considering the price of a GTX960 card.
We’ll do a few more tests at some point. If we find out anything interesting,
I’ll post back here…
Reply
Roelof says
2015-08-09 at 20:42
Hi Tim,
Thanks a lot for your great hardware guide!
I’m planning to build a 3 x Titan X GPU setup, which will be more or less running on
a constant basis: would you say that water cooling will make a big impact on
performance (by keeping the temperatures always below the 80 degrees)?
As the machine will be installed remotely, where I don’t have easy access to it, I’m a
bit nervous about installing a water cooling system in such a setup, with the risk of
cooling leakage, so the “risk” has to be worth the performance gain
Do you have any experience with water cooled systems, and would you say that it
would be a useful addition ?
Also would you advise a nice tightly fit chassis, or a bigger one which allows better
airflow ?
Finally (so many questions :P), would you think 1500 watt with 92-94% efficiency at
100% load should suffice in the case I might use 4 Titan X GPUs, or would it be
better to go for a 1600W PSU ?
Reply
Tim Dettmers says

2015-08-10 at 04:55
If your operate the computer remotely, another option is to flash the BIOS of
the GPU and crank up the fan to max speed. This will produce a lot of noise
and heat, but your GPUs should run slightly below 80 degrees, or at 80 degrees
with little performance lost.
Water cooling is of course much superior but if you have little experience with it
it might be better to just go with an air cooled setup. I have heard if installed
correctly, water cooling is very reliable, so maybe this would be an option when
somebody else, how is familiar with water cooling helps you to set it up.
In my experience, the chassis does not make such a big difference. It is all
about the GPU fans, and getting the heat out quickly (which is mostly towards
the back and not through the case). I installed extra fans for better airflow
within the case, but this only make a difference of 1-2 degrees. What might
help more are extra backplates and small attachable cooling pads for your
memory (both about 2-5 degrees).
I used a 1600W PSU with 4 GTX Titans which need just as much power as a GTX
Titan X and it worked fine. I guess 1500W would also work well and 92-94%
efficiency is really good. I would try with the 1500W one and if it does not work
just send it back.
Reply
Roelof says
2015-08-10 at 16:57
Thanks for the detailed reponse, I’ve decided to go for:

– Chassis: Corsair Carbide Air 540

– Motherboard: ASUS X99-E WS
– Cpu: Intel(Haswell-e) Core i7 5930K
– Ram: 64GB DDR4 Kingston 2133Mhz
– Gpu: 3 x NVIDIA GTX TITAN-X 12GB
– HD1: 2 X 500GB SSD Samsung EVO
– HD2: 3 X 3TB WD Red in RAID 5
– PSU: Corsair AX1500i (1500Watt)
With a custom build water cooling system for both the Cpu and the 3 Titan
X’s, which I hope will let me crank up these babies while keeping the
temperature all times below the 80 degrees.
The machine is partly (at least the chassis is) inspired by Nvidia’s recently
released DevBox for Deep Learning (https://developer.nvidia.com/devbox),
but for almost 1/2 of the price. Will post some benchmarks with the newer
cuDNN v3 once its build and all setup.
Reply
Alex says
2015-11-12 at 01:15
How did your setup turn out ? I am also looking to either build a box
or find something else ready made (if it is appropriate and fits the bill).
I was thinking of scaling down the nvidia devbox as well. I also saw
these http://exxactcorp.com/index.php/solution/solu_detail/233 which
are similar. Very expensive.
Why is there no mention of Main

Gear https://www.maingear.com/custom/desktops/force/index.php any
where? Are they no good? The price seems too good to be true. I
have heard that they break down, but I have also heard that the folks
at Main Gear are very responsive and helpful.
Thanks for any insight and thanks Tim for the great blog posts!
Reply
Axel says
2015-08-08 at 19:17
Hi Tim,
I’m a Caffe user, and since Caffe has recently added support for multiple GPUs, I
have been wondering if I should go with a Titan X or with 2 GTX 980. Which of this
2 configurations would you choose? I’m more inclined towards the 2 GTX 980, but
maybe there are some downsides with this configuration that I haven’t thought
about.
Thanks!
Reply
Tim Dettmers says

2015-08-09 at 05:02
This is relevant. I do not have experience with Caffe parallelism, so I cannot

really say how good it is. So 2 GPUs might be a little bit better than I said in the
quora answer linked above.
Reply
howtobeahacker says
2015-08-08 at 08:19
Hi, I intend to plug in 2 GPU Titan X in my workstation. In the spec of my

workstation, it said that it is possible to have up 2 NVIDIA K20 GPUs. In fact, K20
and TitanX are the same size. However, when I get the first Titan X GPU, I measure
that if I plug the second in, there is a tiny space between 2 GPUs. I wonder if it is
safe for the cooling of the GPU system.
Hope to have your opinion.
Thanks
Reply
Tim Dettmers says

2015-08-08 at 09:15
A very tiny space between GPUs is typical for non-tesla cards and your cards
should be safe. The only problem is, that your GPUs might run slower because
they reach their 80 degrees temperature limit earlier. If you run a unix system,
flashing a custom BIOS to your Titans will modify the fan regulation so that
your GPUs should be cool (< 80 degrees C) at all times. However, this may
increase the noise and heat inside the room where your system is located.
Flashing a BIOS for better fan regulation will most and foremost only increase
the lifetime of your GPUs, but overall everything should be fine and safe
without any modifications even if you operate your cards at maximum
temperature for some days without pause (I personally used the standard
settings for a few years and all my GPUs are still running well).
Reply
howtobeahacker says
2015-08-11 at 01:52
Hi Tim,
Thank for your responses. I read your posts and I remembered an image of
a software in Ubuntu to visualize states of GPU. Something that is similar to
Task Manager for CPUs. If you have information, please let me know.
Reply
howtobeahacker says
2015-08-31 at 09:01
Hi Tim,
I just found a way to increase GPU fan in Ubuntu using Nvidia X server
settings. The details are in http://askubuntu.com/questions/42494/how-
can-i-change-the-nvidia-gpu-fan-speed
Reply
Tim Dettmers says

2015-08-31 at 10:48
Indeed, this will work very well if you have only one GPU. I did not
know that there was a application which automatically prepares the
xorg config to include the cooling settings — this is very helpful, thank
you! I will include that in an update in the future.
Reply
An Tran says
2015-12-07 at 06:20
I just found a way to increase fan speed of multiple GPUs without

flashing. Here is my documentation.
http://antechcvml.blogspot.sg/2015/12/how-to-control-fan-speed-
of-multiple.html
Tim Dettmers says

2015-12-07 at 22:07
This looks great! Thank you!
pedropgusmao says
2015-08-05 at 08:25
Hello Tim,
First of all thanks for always answering my questions and sorry for coming back
with more
Do you think a 980 (4GB) is enough for training current neural nets (alexnet,
overfeat, vgg), or would it be wise to go for a 980ti?
PS: I am a PhD student, time for me is cheaper than euros
Thanks again.
Reply
Tim Dettmers says

2015-08-05 at 08:39
4 GB of memory can indeed be quite short sometimes. If time is cheaper than
money, go for a GTX 980Ti, or even better a GTX Titan X.
Reply
gac says
2015-08-05 at 04:19
Hi Tim,
First of all, excellent blog! I’m putting together a gpu workstation for my research
activities and have learned a lot from the information you’ve provided so ….
thanks!!
I have a pretty basic question. So basic I almost feel stupid asking it but here goes
…
Given your deep learning setup which has 3x GeForce Titan X for computational
tasks, what are your monitors plugged in to?
I would like a very similar setup to yours (except I’ll have two 29″ monitors) and I
was wondering if it’s possible to plug these into the Titan cards and have them
render the display AND run calculations.
Or is it better to just have another, much cheaper, graphics cards which is just for
display purposes?
Reply
Tim Dettmers says

2015-08-05 at 05:42
I have my monitors plugged into a single GTX Titan X and I experience no side
effects from that other than a couple of hundreds MB memory that is needed
for the monitors; the performance for CUDA compute should be almost the
same (probably something like 99.5%). So no worries here, just plug them in
where it works for you (on windows, one monitor would also be an option I
think).
Reply
Vu Pham says
2015-08-04 at 16:14
I’m so sorry the X3 version of Mellanox does not support RDMA but the X4 does
Reply
Vu Pham says
2015-08-04 at 16:07
So, I do a research on deep learning hardware, I assume the most appropriate Part
list is:
Motherboard: X10DRG-Q – This is an dual socket board which alow you to double
the lane of the cpu. It has 4x fully functional x16 PCI Ex 3.0 Slot and an extra 4 x PCI
Ex 2.0 Slot for a Mellanox card.
CPU: 2X E5-2623
Network card: Mellanox ConnectX-3 EN Network Adapter MCX313A-BCBT
Star of the show: 4x TitanX
Assume the other parts are $1000, total cost would be $7,585, half the price of the
Nvidia Dev box. My god NVIDIA.
Reply
Tim Dettmers says

2015-08-04 at 16:44
This sounds like a very good system. I was not aware of the X10DRG-Q
motherboard; usually such mainboards are not available for private customers
— this is a great board!
I do not know the exact topology of the system compared to the Nvidia Dev
box, but if you have two CPUs this means you will have an additional switch
between the two PCIe networks and this will be a bottleneck where you have to
transfers GPU memory through CPU buffers. This makes algorithms
complicated and prone to human error, because you need to be careful how to
pass data around in your system, that is, you need to take into account the
whole PCIe topology (on which network and switch the infiniband card sits etc.,
on which network the GPU sits etc.). Cuda convnet2 has some 8 GPU code for
a similar topology, but I do not think it will work out of the box.
If you can live with more complicated algorithms, then this will be a fine system
for a GPU cluster.
Reply
Vu Pham says
2015-08-05 at 10:05
I got it, so stick to the old plan then, Thank you any way.
Reply
Vu Pham says
2015-08-08 at 15:18
Hi Tim
Fortunately, Supermicro provides me X10DRG-Q mobo diagram, and it
would be also a gerneral diagram for other 2011 dual socket mobo which
has 4 or morethan 4 PCIEX slot. 2 CPU are connected by 2 QPI – Intel
QuickPath Interconnect. If cpu1 has 40 lanes, then 32 lane for 2 PCI ex 16,
4x for 10Gigabit Lan, 4x for a 4x PCI ex (8x slot shape, which will be cover if
you install 3rd graphic card). The 2nd cpu also provide 32 lane for pci
express, then 8x will be 8x slot on the top slot (nearest cpu socket). Pretty
complicated.
The point when I build a perfect 4×16 PCIex3.0 is that I though the
performence gonna be half if the bandwidth go from 16x down to 8x. Do
you have any infomation how much performnce different, said a single
titan x, on a 16x 3.0 and 16x 2.0?
Reply
Tim Dettmers says

2015-08-08 at 15:43
Yes, that sounds complicated indeed! A 16x 2.0 will be as fast as a 8x

3.0, so the bandwidth is also halved by stepping down to 2.0. I do not
think there exists a single solution which is easy and at the same time
cheap. In the end I think the training time will not be that much slower
if you run 4 GPUs on 8x 3.0 and with that setup you would not run in
any programming problems for parallelism and you will be able to use
standard software like Torch7 with integrated parallelism — so I would
just go for a 8x 3.0 setup.
If you want a less complicated system that is still faster, you can think
about getting a cheap InfiniBand FDR card on eBay. That way you
would buy 6 cheap GPUs and all hook them up via InfiniBand at 8x 3.0.
But probably this will be a bit slower than straight 4x GTX Titan X on 8x
3.0 on a single board.
Reply
Mohamad Ivan Fanany says

2015-07-31 at 09:03
Hi Tim, very nice sharing. I just would like to comment on the ‘silly’ parts (smile): the
monitors. Since I only have one monitor, I just use NoMachine and put the screen in
one of my virtual workspaces in ubuntu to switch between the current machine and
our deep learning servers. Surprisingly this is more convenient and energy efficient
both for the electricity and our neck movement. Just hope this would help
especially those who only have single monitor. Cheers.
Reply
Tim Dettmers says

2015-07-31 at 09:15
Thanks for sharing your working procedure with one monitor. Because I got a
second monitor early, I kind of never optimized the workflow on a single
monitor. I guess when you do it well, as you do, one monitor is not so bad
overall — and it is also much cheaper!
Reply
Xardoz says
2015-07-30 at 09:16
Very Useful information indeed, Tim.
I have a newbie question: If the motherboard has integrated graphics facility, and if
the GPU is to be dedicated to just deep learning, should the display monitor be
connected directly to the motherboard rather than the GPU?
I have just bought a machine with GeForce Titan X card and they just sent me a e-
mail saying:
“You have ordered a graphics card with your computer and your motherboard
comes supplied with integrated graphics. When connecting your monitor it is
important that you connect your monitor cable to the output on the graphics card
and NOT the output on the motherboard, because by doing so your monitor will
not display anything on the screen.”
Intuitively,it seems that off-loading the display duties to the motherboard will free
the GPU to do more important things. Is this correct? If so, do you think that this
can be done simply? I would ask the supplier, but they sounded lost when I started
talking about deep learning on Graphics cards.
Regards
Xardoz
Reply
Tim Dettmers says

2015-07-30 at 10:41
Hi Xardoz! You will be fine when you use connect your monitor to your GPU
especially if your using a GTX Titan X. The only significant downside of this is
some additional memory consumption which can be a couple of hundred MB. I
have 3 monitors connected to my GPU(s) and it never bothered me doing
deep learning. If you train very large convolutional nets that are on the edge of
the 12GB limit, only then I would think about using the integrated graphics.
Reply
Xardoz says
2015-08-05 at 08:24
Thanks Tim.
It seems that my motherboard graphics capability (Asus Z97-P with an Intel

i7-4790k) is not available if a Graphics card is installed.
And yes I do need more than 12GB for training a massive NN! so I decided
to buy a small Graphics card just to run the display as suggested in one of
your comments above. Seems to work fine.
Regards
Reply
Charles Foell III says

2015-07-24 at 16:06
Hi Tim,
1) Great post.
2) Do you know how motherboards with dedicated PCI-E lane controllers shuffle
data between GPUs with deep learning software? For example, the PLX PEX 8747
purports control of 48 PCI-E lanes beyond the 40 lanes a top-shelf CPU controls,
e.g. allowing five x16 connections, but it’s not clear to me if deep learning software
makes use of such dedicated PCI-E lane controllers.
I ask since going beyond three x16 connections with CPU control of PCI-E lanes
only requires dual CPU, but such boards along with suitable CPUs can be in sum be
thousands of dollars more expensive than a single CPU motherboard that has a PLX
PEX 8747 chip. If the latter has as good performance for deep learning software,
might as well save the money!
Thanks!
-Charles
Reply
Tim Dettmers says

2015-07-24 at 16:21
That is very difficult to say. I think the PLX PEX 8747 chip will be handled by the
operating system after you installed some driver, so that deep learning software
would use it automatically in the background. However, it is unclear to me if
you really can operate three GPUs in 16/16/16 when you use this chip, or if it will
support peer-to-peer GPU memory transfers. I think you will need to get in
touch with the manufacturer for that.
Reply
Charles Foell III says

2015-07-25 at 00:12
Hi Tim, makes sense. Thanks for the reply.
I’ll need to dig more. I’ve seen various GPU-to-GPU benchmarks for server-
grade motherboards (e.g. in HPC systems), including a raw ~ 7 GB/s using
a PLX PEX chip (lower than host-to-GPU), but I’ve had difficulty finding
benchmarks for single-CPU boards, let alone for more than three x16 GPU
connections.
If you come across a success story of a consumer-grade single-CPU system

with exceptional transfer speed (better than 40 PCI-E 3.0 lanes worth in
sum) between GPUs when running common deep learning
software/libraries, or even a system with such benchmarks for raw CUDA
functions, please update.
In the meantime, I look forward to your other posts!
Best,
Charles
Reply

2017-05-14 at 18:58
AMD’s Naple CPU is expected to provide 128 lanes: 64 lanes for 4 PCIe
expansion cards at x16 and the remaining for CPU-to-CPU interconnect
(called Infinity Fabric).
Source:
https://arstechnica.co.uk/information-technology/2017/03/amd-
naples-zen-server-chip-details/
Reply

2017-05-14 at 19:08
In another article, it is implied that with 1xCPU systems, 128 lanes will
be available for I/O, presumably allowing for full x16 lanes on up to 8
GPUs, or for use with NVLink bridges.
Source:
http://www.anandtech.com/show/11183/amd-prepares-32-core-naples-
cpus-for-1p-and-2p-servers-coming-in-q2
Reply
Jon says
2015-07-16 at 17:09
Will Ecc RAM make Convolution NN or deep learning more efficient or better? In
another word, if the same money can buy me one PC with Ecc RAM vs TWO PC
without Ecc RAM, which should I pick for deep learning?
Reply
Tim Dettmers says

2015-07-16 at 17:47
I think ECC memory only applies to 64 bit operations and thus would not be
relevant to deep learing, but I might be wrong.
ECC corrects if a bit is flipped in the wrong way due to physical inconsistencies
at the hardware level of the system. Deep learning was shown to be quite
robust to inaccuracies, for example you can train a neural network with 8-bits (if
you do it carefully and in the right way); training a neural network with 16-bit
works flawlessly. Note that training on 8 bit for example, will decrease the
accuracy for all data while ECC is relevant only for some small parts of the data.
However, a flipped bit might be quite severe while a conversion from 32 to 8-
bits might still be quite close to the real value. But overall I think an error in a
single bit should not be so detrimental to performance, because the other
values might counterbalance this error or in the end softmax will buffer this (an
extremely large error value sent to half the connections might spread to the
whole network, but in the end for that sample the probability for the softmax
will be just 1/classes for each class).
Remember that there are always a lot of samples in a batch, and that the error
gradients in this batch are averaged. Thus even large errors will dissipate
quickly, not harming performance.
Reply
Tran Lam An says

2015-07-16 at 04:14
Hi Tim,
Thank for your support on Deep Learning group.
I have a workstation DELL T7610 http://www.dell.com/sg/business/p/precision-
t7610-workstation/pd.
I want to plug in 2 Titan X from NVIDIA and ASUS. Everything seems okay, I just
wonder about PSU, cooling, and dimensions of GPU.
I will check the cooling and dimensions latter. My main concerns is about power.
I look documents http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-

titan-x/specifications and https://www.asus.com/Graphics-
Cards/GTXTITANX12GD5/specifications/.
Both of them requires power up to 300W.
However in the specs of workstation, they said sth about graphics card that:
Support for up to three6 PCI Express® x16 Gen 2 or Gen 3 cards up to 675W (total
for graphics (some restrictions apply))
GPU: One or two NVIDIA Tesla® K20C GPGPU – Supports Nvidia Maximus™
technology.
So the total power seems okay, right?
Another evidences:
The power of the workstation would be:
Power Supply: 1300W (externally accessible, toolless, 80 Plus® Gold Certified, 90%
efficient)
CPU ( 230W ) + 2GPU( 300*2 ) + 300 = 1130W.

It seems okay for the power, right?
Hope to have your opinions.

Thank you for your sharing.
Reply
Tim Dettmers says

2015-07-16 at 05:55
Everything looks fine. I ran 3 GTX Titan with a 1400 watt PSU and 4 GTX Titan
with 1600 watt, so you should definitely be fine with 1300 watt and 2 GPUs. A
GTX Titan also uses more power than a GTX Titan X. Your calculation looks
good and there might even be space for a third GPU.
P.S. The comments are locked in place to await approval if someone new posts
on this website. This is so to prevent spam.
Reply
Haider says
2015-07-07 at 01:38
Tim,
I am new to deep NN. I discovered its tremendous progress after seeing the
excellent 2015 GTC NVidia talk. Deep NN will be very useful for my Phd which is
about electrical brain signal classification (Brain Computer Interface).
What a joy I found your blog! Just wished if you wrote more.
All your post are full of interesting ideas. I have checked the comments of the posts
which are not less interesting than the posts themselves and full of important hints
too.
I read a lot, but did not find most of your interesting hints on hardware elsewhere.
Your posts were just brilliant. I believe your posts filled a gap in the web, especially
on the performance and the hardware side of deep NN.
I think on the hardware side, after reading your posts I have enough knowledge to
build a good system.
On the software side, I found a lot of resources. However, I am still a bit confused.
Perhaps, because it wasn’t your posts . Why do you only write on hardware?
Your can write very well, and we love to hear from your experience on software
too..
From where should I begin?
I’m very fond of Matlab and didn’t program much in other languages. And I don’t
know anything about python, which seems very important to learn for machine
learning. I don’t mind to learn python if you advise me to do so. But if it is not
necessary, then maybe I can spare my time to learn other deep NN stuff, which are
overwhelming already. My excitement crippled me. I have opened ~600 tabs and
want to see them all.
If you were in my shoes, what platform you will begin to learn with? Caffe, Torch or
Theano ? Why?
And please, tell me too about your personal preference. I learned from your posts
that you are making your own programs. But in case you are picking one of these
for you, what will be. And in case you were like me with no python experience, what
will you pick in that case?
I am very interested to hear your opinion. I am not in hurry.. When you feel like
writing please answer me with some details.
I thank you sincerely for all the posts and comment replies in your blog and eager
to see more posts from you Tim!
Thank you!
Reply
Tim Dettmers says

2015-07-08 at 07:14
Thank you for all this praise — this is encouraging! I wrote about hardware
mainly because I myself focused on the acceleration of deep learning and
understanding the hardware was key in this area to achieve good results.
Because I could not find the knowledge that I acquired elsewhere on a single
website, I decided to write a few blog posts about this. I plan to write more
about other deep learning topics in the future.
In my next posts I will compare deep learning to the human brain: I think this
topic is very important because the relationship between deep learning and
the brain is in general poorly understood.
I also wanted to make a blog post about software, but I did not have the time
yet to do that — I will do so probably in this month or the next.
Regarding your questions, I would really recommend Torch7, as it is the deep

learning library which has the most features and which is steadily extended by
facebook and deepmind with new deep learning models from their research
labs. However, as you posted above, it is better for you to work on windows
and Torch7 does not work well on windows. Theano is the best option here I
guess, but also Minerva seems to be okay.
Caffe is a good library when you do not want to fiddle around to much within a
certain programming language and just want to train deep learning models;
the downside is that it is difficult to make changes to the code and the training
procedure/algorithm and few models are supported.
In the case of brain signals per se, I thin python offers a lot of packages which
might be helpful for your research.
However, if you just want to get started quickly with the language you know,
Matlab, then you can also use the neural network bindings from the Oxford
research group, with which you can use your GPU to train neural networks
within Matlab.
Hope this helps, good luck!

Reply
Zizhao says
2015-06-25 at 13:15
Do you think if you have too many monitors, it will occupy too much resources of
your GPU already? If yes, how to solve this issue?
Reply
Tim Dettmers says

2015-06-26 at 07:48
I have three monitors with 1920×1080 resolution and the monitors draw about
400 MB. For me I never had any issues with this, but I also had 6GB cards and I
did not train models that maxed out my GPU RAM. If you have a GPU with less
memory (GTX 980 or GTX 970) then there might be some problems for
convolutional nets. The best way to circumvent this problem is to buy a really
cheap GPU for the monitors (a GT210 costs about $30 and can power two
(three?) monitors), so that your main deep learning GPU is not attached to any
monitor.
Reply
Sameh Sarhan says

2015-07-06 at 20:04
Tim, you have a wonderful blog and I am very impressed with the
knowledge as well as the effort that you are putting into it.
I run a silicon valley startup that works in the space of wearbales Bio-
sensing , we developed very unique non-invasive sensors , that can
measure vitals , psychological and physiological effects. Most of our signals
are multivariate time series, with a typically process (1×3000) per sensor per
reading , and we can typically use up to 5 sensors.
We are currently expanding our ML algorithms to add CNNs capabilities, I
wonder what do you recommend in terms of GPU.
Also I would highly appropriate if you can email me to further discuss
potentially mutually beneficial collaboration
Regards,
Sameh
Reply
Tim Dettmers says

2015-07-08 at 07:24
Hi Sameh! If you have multivariate time series a common CNN

approach is to use a sliding windows over your data on X time steps.
Your convolutional net would then use temporal instead of spatio-
temporal convolution which would use much less memory. As such,
6GB of memory should probably be sufficient for such data and I
would recommend a GTX 980 Ti, or a GTX Titan. If you need to run
your algorithms on very large sliding windows (an important signal
happened 120 time steps ago, to which the algorithm should be
sensitive to) a recurrent neural network would be best for which 6GB of
memory would also be sufficient. If you want to use CNNs with such
large windows it might be better to get a GTX Titan X with 12GB
memory.
Regards,
Tim
Reply
Sergii says
2015-06-19 at 12:11
Sorry that my question was confusing.

I write simple code which runs axpy cublas kernels and memcpy. As you can see
from the profiler ![](http://imgur.com/dmEOZTY,q2HhqlX#1), in case of pinned
memory the kernels that were launched after cudMemcpyAsync run in parallel (with
respect to transfer process).
However, in case of pageable memory, ![](http://imgur.com/dmEOZTY,q2HhqlX)

cudMemcpyAsync blocks the host, and I can’t launch the next kernel.
In the chapter `Direct memory access (DMA)` you say “…on the third step the
reserved buffer is transferred to your GPU RAM without any help of the CPU…”, so
why does cudMemcpyAsync block the host until the end of the copy process? What
is the reason for that?
Reply
Tim Dettmers says

2015-06-19 at 13:07
The most low-level reason I can think of is, as I said above, that pageable
memory is inherently insecure and may be swapped/pushed around at will. If
you start a transfer and want to make sure that everything works, it is best to
wait until the data is fully received. I do not know about the low level details
how the OS and its drivers and routines (like DMA) interact with the GPU. If you
want to know these details, I think it would be best to consult with people from
NVIDIA directly, I am sure they can give you a technical accurate answer; you
might also want to try the developer forums.
Reply
Sergii says
2015-06-18 at 13:36
Thank you for the reply.

Do you know what is the reason for the inability to have overlapping pageable host
memory transfer and kernel execution?
Reply
Tim Dettmers says

2015-06-18 at 13:42
It has all to do with having a valid pointer to the data. If your memory is not
pinned, then the OS can push the memory around freely to make some
optimizations, so you are not certain to have a pointer to CPU memory and
thus such transfers are not allowed by the NVIDIA software because they easily
run into undefined behaviour. With pinned memory, the memory no longer is
able to move, and so a pointer to the memory will stay the same at all times, so
that a reliable transfer can be ensured.
This is different in GPUs, because GPU pointers are designed to be reliable at all
times as long as they stay on some GPU memory, so these problems do not
exist for GPU -> GPU transfers.
Reply
Sergii says
2015-06-18 at 14:11
Thanks for the wonderful explanation. But I still have a question. Your
previous reply can explain why data transfer with pageable memory can’t
be asynchronous with respect to a host thread, but I still do not understand
why a device can’t execute kernel while copying data from a host. What is
the reason for that?
Reply
Tim Dettmers says

2015-06-18 at 14:57
Kernels can execute concurrently, the kernel just needs to work on a

different data stream. In general, each GPU can have 1 Host to GPU,
and 1 GPU to GPU transfer active, and execute a kernel concurrently on
unrelated data in another stream (by default all operations use a
default stream and are thus not concurrent).
But you are right that you cannot execute a kernel and a data transfer
in the same stream. I assume there are issues with the hardware not
being able to resume a kernel once the end of a steam that is being
transferred at the very moment is reached (the kernel would need to
wait, then compute, then wait, then compute, then wait… — this will
not deliver good performance!). So it will be because of this that you
cannot run a kernel on partial data.
Reply
Sergii says
2015-06-18 at 12:08
Hi Tim!
Thanks for your helpful and detailed write-up.
It seems from this blog post (http://devblogs.nvidia.com/parallelforall/how-overlap-

data-transfers-cuda-cc/) that for concurrent kernel execution and data transfer the
memory must be pinned.
You wrote “…one might be able to prevent that when one uses pinned memory, but
as you shall see later, it does not matter anyway…” and AFAIU you don’t use pinned
memory in the async batch allocation process (`clusternet` project).
Also, pinned memory is mentioned in the documentation

(http://docs.nvidia.com/cuda/cuda-c-best-practices-
guide/index.html#asynchronous-transfers-and-overlapping-transfers-with-
computation), but at the same time this (http://docs.nvidia.com/cuda/cuda-
runtime-
api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f77
6b4c4d5f89f7124b7bf79) document says “…the copy may overlap with operations in
other streams…” and no mention about pinned memory.
These contradictory facts are a bit confusing to me. So my question is:

– Do you have the code that can confirm overlapping between transfer process of
pageable host memory to the device memory and kernel execution?
– And what is actually going on with cudaMemcpyAsync?
Reply
Tim Dettmers says

2015-06-18 at 12:30
What you write is all true, but you have to look at it in two different ways, (1)
CPU -> GPU, and (2) GPU -> GPU.
For CPU -> GPU you will need pinned memory to do asynchronous copies;
however, for GPU -> GPU the copy will be automatically asynchronous in most
use cases — no pinned GPU memory needed (cudaMemcpy and
cudaMemcpyAsync are almost always the same for GPU -> GPU transfers).
I turns out that I use pinned memory in my clusterNet project, but it is a bit
hidden in the source code: I use it only for batch buffers in my BatchAllocator
class, which has an embarrassingly poor design. There I transfer usual CPU
memory, to a pinned buffer (while the GPU is busy) and then transfer it in
another step asynchronously to the GPU, so that the batch is ready when the
GPU needs it.
You can also allocate the whole data set as pinned memory, but this might
cause some problems, because once pinned, the OS cannot “optimize” the
locked in memory anymore which may lead to performance problems if one
allocated a chunk of memory too large.
Reply
Kai says
2015-06-15 at 11:21
Hey Tim! Thanks for these posts, they’re highly, highly appreciated! I’m just starting
to get my feet wet in deep learning – is there any way to hook up my Laptop to a
GPU (maybe even an external one?) without having to build a PC from scratch so I
could start GPGPU programming on small datasets with less of an investment?
Does the answer depend on my motherboard?
Reply
Tim Dettmers says

2015-06-15 at 11:49
In that case it will be best to use AWS GPU spot instances which are cheap and
fast. External GPUs are available, but they are not an option because the data
transfer, CPU -> USB-like-interface -> GPU, is too slow for deep learning. Once
you have made some experiences with AWS I would then buy a dedicated
deep learning PC.
Reply
Kai says
2015-06-15 at 14:32
Thanks, that sounds like a good idea then!
Reply
Frank Kaufmann (@FrankKaufmann76) says

2015-06-09 at 17:52
What are your thoughts on the GTX 980 Ti vs. the Titan X? I guess with “980” in
your article you referred to the 4 GB models. The 980 Ti has the same Memory
Bandwidth as the Titan X, 2GB more memory than a 980 (which should make it
better for big convnets), only a few CUDA cores less. And the price difference is 549
USD for a 980 Ti vs 999 USD for the Titan X.
Reply
Tim Dettmers says

2015-06-15 at 11:52
The GTX 980 Ti is a great card and might be the most cost effective card for
convolutional nets right now. The 6GB RAM on the card should be good
enough for most convolutional architectures. If you will be working on video
classification or want to use memory-expensive data sets I would still
recommend a Titan X over a 980 Ti.
Reply
Sinan says
2015-05-29 at 04:11
That sounds interesting. Would you mind sharing more details about your G3258-
based system?
Reply
Tim Dettmers says

2015-05-29 at 05:17
I do not have a Haswell G3258 and I would not recommend one, as it only runs
16 PCIe 3.0 lanes instead of the typical 40. So if you are looking for a CPU I
would not pick Haswell — too new and thus too expensive, and many Haswells
do not have full 40 PCIe lanes.
Reply
Sinan says
2015-05-29 at 05:31
Sorry Tim, my comment was meant to be in response to the comment

#128 by user “lU” from March 9, 2015 at 10:59 PM. I wonder why it didn’t
appear under that one despite having double-checked before posting. I
guess it’s the fault of my mobile browser.
First of all, thank you for a series of very informative posts, they are all
much appreciated.
I was planning to go for a single GPU system (GTX 980 or the upcoming
980 Ti) to get started with deep learning, and I had the impression that at
$72, this is the most affordable CPU out there.
Reply
Tim Dettmers says

2015-05-29 at 05:43
You’re welcome! I was looking for other options, but to my surprise

there were not any in that price range. If you are using only a single
GPU and you are looking for the cheapest option, this is indeed the
best choice.
Reply
Mark says
2015-05-26 at 12:26
That second 10x speed up claim with NVLink is a bit strange bc it is not clear how it
is being made.
Reply
Richard says
2015-05-16 at 16:53
Hi Tim,
First can I say thanks very much for writing this article – it has been very
informative.
I’m a first year PhD student. My research is concerned with video classification and
I’m looking into using convolutional nets for this purpose.
My current system has a Gt 620 which takes about 4 hours to run a lenet5 based
network built using theano on MNIST. So I’m looking to upgrade and I have about
£1000 to do it with.
I’ve allocated about £500 for the gpu but I’m struggling to decide what to get. I’ve
discounted the gtx 970 due to the memory problems. I was thinking either gtx 780
(6gb asus version), gtx 980 or two gtx 960’s. What is your opinion on this? I know I
can’t use multiple gpus with theano but I could run two different nets at the same
time on the 960’s, however would it be quicker just to run each net consecutively
on the 980 since its faster. Also there’s the 780 which although would be slower
than the 980 it has more ram which would be beneficial For convolutional nets. I
looked into buying second hand as you suggested however I’m buying through my
university so that isn’t an option.
Thanks for your help and for the great article once again.
Cheers,
Richard
Reply
Tim Dettmers says

2015-05-17 at 15:39
That is really a tricky issue, Richard. If you use convolutional on the spatial
dimensions of an image as well as the time dimension, you will have 5
dimensional tensors (batch size, rows, columns, maps, time) and such tensors
will use a lot of memory. So you really want a memory card. If you use
the Nervana Systems 16-bit kernels you would be able to reduce memory
consumption by half; these kernels are also nearly twice as fast (for dense
connections there are more than twice as fast). To use the Nervana Systems
kernels, you will need a Maxwell GPU (GTX Titan X, GTX 960, GTX 970, GTX
980). So if you use this library a GTX 980 will have “virtually” 8GB of memory,
while the GTX 780 has 6GB. The GTX 980 is also much faster than the GTX 780,
which further adds to the GTX 980 options. However, the Nervana Systems
kernels still lack some support for natural language processing, and overall you
will have a far more thorough software if you use torch and a GTX 780. If you
think about adding your own CUDA kernels, the Nervana Systems + GTX 980
option may be not so suitable, because you probably will need to handle the
custom compiler and program 16-bit floating point kernels (I have not looked
at this, but I believe there will be things which makes it more complicated than
regular CUDA programming).
I think both, GTX 780 and GTX 980 are good options. The final choice is up to
you!
Hope this helps!
Cheers,
Tim
Reply
Richard says
2015-05-20 at 11:50
Thanks for the detailed response Tim,
Think i’ll go with the 780 for now due to the extra physical memory. Quick
follow up question: if I have the money for an additional card in the future
would I need to buy the same model. Could I for example have both a GTX
780 and a GTX 980 running in the same machine so that I can have two
different models running on each card simultaneously? Would there be any

issues with drivers etc? Going to order the parts for my new system
tomorrow will post some benchmarks soon.
Cheers,
Richard
Reply
Tim Dettmers says

2015-05-20 at 14:01
GPUs can only communicate directly if they are based on the same
chip (but brands may differ). So for parallelism you would need to get
another GTX 780, otherwise a GTX 980 is fine for everything else. Also
remember, that new Pascal GPUs will hit around Q3 2016 and those will
be significantly faster than any Maxwell GPU (3D memory) — so
waiting might be an option as well.
Reply
Mark says
2015-05-26 at 12:21
FYI on Pascal chip from NVIDIA. Speed up over Titan is “up to 5x.” Of this, a
2x speed up will come from the option of switching to using 16 bit floating
point in Pascal.
The rest of the “up to 10x speed up” comes from the 2x speed up you get
from NVLink. Here the comparison is two Pascal versus two Titans. I don’t
know what the speed up would be if the Pascals used the same PCI
interlink as the Titans or if they could even use the PCI interlink. Hopefully
so then a new motherboard would not be necessary.
Reply
Thomas says
2015-05-12 at 23:46
Hi Tim,
Thank you for all your advice on how to build a machine for DL!
You don’t talk about the possibility of using an embedded GPU in the motherboard
(or a “small” second GPU) so as to dedicate the “big” GPU to calculus. Could that
affect the performance in any way?
Also we want to build a computer to reproduce and improve -by making a more
complex model- the work of DeepMind about their generalist AI.
We were thinking about getting one Titan X with 32G of RAM.
Would you have any specific recommendation concerning the motherboard and
CPU?
Thank you very much
Reply
Tim Dettmers says

2015-05-13 at 14:45
There are some GPUs which are integrated (embedded) in regular CPUs and
you can run your monitors on these processors. The effect of this is some
saved memory (about a hundred MB for each monitor) but very little
computational resources (less than 1 % for 3 monitors). So if you are really
short on memory (say you have a GPU with 2 or 3GB and 3 monitors) then this
might make good sense. Otherwise, it is not very important and buying a CPU
with integrated graphics should not be a deciding factor when you buy a CPU.
As I said in the article, you have a wide variety of options for the CPU and
motherboard, especially if you will stick with one GPU. In this case you can
really go for very cheap components and it will not hurt your performance
much. So I would go for the cheapest CPU and motherboard with a reasonable
good rating on pcpartpicker.com if I were you.
Reply
Thomas says
2015-05-13 at 16:53
Thank you very much!
Reply

2015-05-11 at 13:49
Tim,
Thanks for the excellent guide! It has helped us a lot. However, a few questions
remain…
We plan to build a deep-learning machine (in a server rack) based on 4 Titan cards.
We need to select other hardware. Ideally we would put all four cards on a single
board with 4x PCIe 3.0 x16. The questions are:
1. If I understand correctly, GPU intercommunication is the bottleneck. Should we go

for dual 40-lane CPUs (Xeons only, right?), or take a single i7 and connect the cards
with SLI?
2. Will any 4x PCIe 3.0 x16 motherboard do? Is socket 2011 preferable?
We plan to use these nets for both constitutional and dense learning. Our budget
(everything except the Titans) is around $3000, preferably less, or a bit more if
justified. Please advise!
Reply

2015-05-11 at 14:20
I just read the above post as well and got some needed information, sorry for
spamming. From what I understand, SLI is not beneficial.
Should we then go for two weaker Xeons (2620), each with 40 PCIe lanes? Will
this be cost-optimal?
Thanks,
F
Reply
Tim Dettmers says

2015-05-11 at 14:38
2 CPUs will typically yield no speedup because usually the PCIe networks of
each CPU (2 GPUs for each CPU) are disconnected which means that the GPU
pairs will communicate through CPU memory (max speed about 4 GB/s,
because a GPU pair will share the same connection to the CPU on a PCIe-
switch). While it is reasonable for 8 GPUs, I would not recommend 2 CPUs for a
4 GPU setup.
There are motherboards that work differently, but these are special solutions
which often only come in a package of a whole 8 GPU server rack ($35k-$40k).
If you use a single GPU then any motherboard with enough slots and which
supports 4 GPUs will do; choose the CPU so that it supports 40 PCIe lanes and
you will be ready to go. Socket 2011 has no advantage over other sockets which
fulfill these requirements.
Regarding SLI: SLI can be used for gaming, but not for CUDA (it would be too
slow anyways); so communication is really all done by PCI Express.
Hope this helps!
Reply
Florijan says
2015-05-12 at 11:20
It does, thanks! We are still deciding between a single CPU vs dual-CPU

(for other computing purposes). Could you comment on the following two
motherboards being suitable for our 4 titans:
http://www.asus.com/us/Commercial_Servers_Workstations/X99E_WS/over
view/
http://www.asus.com/Commercial_Servers_Workstations/Z10PED8_WS/over
view/
In particular the Z10PED8 states it supports “4 x PCIe 3.0/2.0 x16 (dual x16
or quad x8)”, from which I understand it does NOT support quad x16.
Would the X99 be the best solution then?
Reply
Tim Dettmers says

2015-05-12 at 11:58
It is quite difficult to say which one is better, because I do not know the
PCIe switch layout of the dual CPU motherboard. The most common
PCIe switch layout is explained in this article and if the dual CPU
motherboard that you linked behaves in a similar way, then for deep
learning 2 CPUs will be definitely be slower than 1 CPU if you want to
use parallel algorithms across all 4 GPUs; in that case the 1 CPU board
will be better. However, this might be quite different for other
computing purposes than deep learning and a 2 CPU board might be
better for those tasks.
Reply
sacherus says
2015-05-05 at 22:44
Hi Tim,
thank you for your great article. I think it’s cover everything that you need to know
to start your journey with DL.
I’m also grad student (but instead of image processing, I’m in speech processing)
and want to buy some machine (I thinking also about Kaggle, but for beginning I
could take 20-40 place ). I want to buy (East Europe) used workstation (without
graphics) + used graphics. Probably I will end up with 2 cards in my computer…
Maybe 3….
Questions:
1) You wrote that you need to have 7 3.0 slots motherboard for 3 GPUs. Isn’t
possible to have
16 x | 1x | 16x | 1x (etc) setup? Like in http://www.msi.com/product/mb/Z87-G45-
GAMING.html#hero-overview?
2) So there do not exist setups that support 16x/16x (or are to expensive)?
3) I see that computation compatibility also matters. I can buy geforce 780 ti in
similar price to gtx 970. 780 ti has better bandwith + more GFLOPS (you never
mentioned about FLOPS), but 970 has newer CC + more memory.
4) Maybe I should let go and buy what… 960 or 680 ( just start)… However, 970 is
not much expensive than those 2. Or just buy whole used PC.?
Tim, what do you think?
Reply
Tim Dettmers says

2015-05-06 at 05:50
1. You are right, a 16x | 1x | 16x | 1x setup will work just as well; I did not thought
about that in this way, and I will update my blog with that soon — thanks!
2. I hope I understand you right: You have a total of 40x PCIe lanes supported
by your CPU (not the physical slots, but this is sort of the communication wires
that are layed from the PCIe slots to the CPU) and your GPUs will use up to 16x
(standard mainboards) for that; so 16x 16x is standard if you use 2 GPUs, for 3
GPUs this is 16x8x16 and for 4 GPUs 16x8x8x8. If you mean physical slots, then a
16x | Yx | 16x setup will do, where Y is any size; because most GPUs have a
width of two PCIe slots you most often cannot run 2 GPUs on 16x | 16x
mainboard slots, sometimes this will work if you use watercooling though
(reduces the width to one slot)
3. GFLOPS do not matter in deep learning (its virtually the same for all
algorithms), your algorithms will always be limited by bandwidth; the 780 TI has
higher bandwidth, but inferior architecture and the GTX 970 would be faster.
However, the GTX 780 TI has no gliches, and so I would go with the GTX 780 TI
4. The GTX 680 might be a bit more interesting than the GTX 780 TI if you
really want to train a lot of convolutional nets; otherwise a GTX 780 TI is best; if
you only use dense networks you might want to go with the GTX 970
Reply
Yu Wang says
2015-04-29 at 22:38
Hi Tim,
Thanks for the insightful posts. I’m a grad student working in the image processing
area. I just started to explore some deep learning techniques with my own data. My
dataset contains 10 thousand 800*600 images with 50+ classes. I’m wondering
GTX970 will be sufficient to try different networks and algorithms, including CNN.
Reply
Tim Dettmers says

2015-05-01 at 04:45
Although your data set is very small and you will only be able to train a small
convolutional net before you overfit the size of the images is huge.
Unfortunately, the size of the images is the most significant memory factor in
convolutional nets. I think a GTX 970 will not be sufficient for this.
However, keep in mind, that you can always shrink the images to keep them
manageable. for a GTX 970 you will need to shrink them to about 250*190 or
so.
Reply
Yu Wang says
2015-05-01 at 23:39
Thanks for the quick reply. Look forward to your new articles.
Reply
Dimiter says
2015-04-28 at 08:40
Tim,
Thanks for a great write-up. Not sure what I’d have done without it.
A bit of a n00b question here,
Do you thinks it matters in practice if one has PCI2 2.0 or 3.0?
Thanks
Reply
Tim Dettmers says

2015-04-28 at 09:42
If it is possible that you will have a second GPU at anytime in the future
definitely get a PCIe 3.0 CPU and motherboard. If you use additional GPUs for
parallelism, then in the case of PCIe 2.0 you will suffer a performance loss of
about 15% for a second GPU, and much larger losses (+40%) for your third and
fourth GPU. If you are sure that you will stay with one GPU in the future, then
PCIe 2.0 will only give you a small or no performance decrease (0-5%) and you
should be fine.
Reply
Mark says
2015-04-28 at 16:09
This may not make much difference if you care about a new system now or
about having a more current system in the future. However, if you want to keep
it around for years and use it for other things besides ML then wait a few
months.
Intel’s Skylake CPU will be released in a few months along with it’s new chip set,
new socket, new motherboards etc. All PCI 3, ddr4, etc. It’s considered a big
change compared to prior CPU’s. Skylake prices are suppose to be similar to
current offerings but retailers say they expect the price of ddr4 to drop. Don’t
really understand why but gamers are also waiting for the release … maybe just
because “new and improved” since it doesn’t seem to translate into a big plus
for the gaming experience.
Reply
Bjarke Felbo says

2015-04-25 at 21:31
Thanks for a great guide! I’m wondering if you could give me a rough estimate of
the performance boost I would get by upgrading my system? Would be awesome
to have that before I spend my hard-earned money! I supposed it’s mainly based
on my current GPU, but here’s a bit of info about the rest of the system as well.
Current setup:
ATI Radeon™ HD 5770 1gb
One of the last CPU’s from the 775-socket series.
4gb ram
SSD
Upgraded setup:
GTX 960 4gb
Modern dual-thread CPU with 2+ GHz
8gb ram
SSD
Two more questions:

1) I’ve sometimes experienced issues between different motherboard brands and
cetain GPU’s. Do you have a recommendation for a specific motherboard brand (or
specific product) that would work well with a GTX 960?
2) Any idea of what the performance reduction would be by doing deep learning in
caffe using a Virtualbox environment of Ubuntu instead of doing a plain Ubuntu
installation?
Reply
Tim Dettmers says

2015-04-26 at 08:12
It is difficult to estimate the performance boost if your previous GPU is a ATI

GPU; but for the other hardware pieces you should see about a 5-10% increase
in performance.
1. I never had any problems with my motherboards, so I cannot give you any
advice here on that topic.
2. I also had this idea once, but it is usually impossible to do this: CUDA and
virtualized GPUs do not go together, you will need specialized GPUs (GRID
GPUs, which are used on AWS); even if they would go together there would be
a stark performance decrease.
It it a great change to go from windows to ubuntu, but it is really worth doing if

you are serious about deep learning. A few months in ubuntu and you will
never want to go back!
Reply
Bjarke Felbo says

2015-04-26 at 20:19
Thanks for the quick response! I’ll try Ubuntu then (perhaps some dual-
booting). Would it make sense to add water-cooling to a single GTX 960 or
would that be overkill?
Reply
Shinji says
2015-04-10 at 07:31
Hi Tim, this is a great post!
I’m interested in the actual PCIe bandwidth in the deep learning process. Are PCIe
16 lanes needed for deep learning? Of course x16 PCIe gen3 is ideal for the best
performance, but I’m wondering if x8 or x4 PCIe gen3 is also enough performance.
Which do you think better solution if the system has 64 PCIe lanes?
* 4 GPGPUs connected with 16 PCIe lanes each

* 16 GPGPUs connected with 4 PCIe lanes each
Which is important factor, the number of GPGPU (calculation power) or PCIe

bandwidth?
Reply
Tim Dettmers says

2015-04-10 at 08:50
Each PCIe lane for PCIe 3.0 has a theoretical bandwidth of about 1 GB/s, so you
can run GPUs also with 8 lanes or 4 lanes (8 lanes is standard for at least one
GPU if you have more than 2 GPUs), but it will be slower. How much slower will
depend on the application or network architecture and which kind of
parallelism is used.
64 PCIe lanes are only supported by two CPU motherboards and these boards
often have a special PCIe switching architecture which connects the two
separate PCIe systems (one for each CPU) with each other; I think you can only
run up to 8 GPUs with such a system (the BIOs often cannot handle more GPUs
even if you have more PCIe slots). But if you take this as a theoretical example it
is best to just do some test calculations:
16 GPUs means 15 data transfers to synchronize information; 4 PCIe lanes / 15
transfers = 0.2666 GB/s for a full synchronization. If you now have a weight
matrix with say 800×1200 floating point numbers you have 800x1200x1024^-3=
0.0036 GB. This means you could synchronize 0.2666/0.0036 = 74 gradients
per second. A good implementation of MNIST with batchsize 128 will run with
about 350 batches per second. So the result is that 16 GPUs with 4 PCIe lanes
will be 5 times slower for MNIST. These numbers are better for convolutional
nets, but not much better. Same for 4 GPUs/16 lanes:
16/3 = 5.33; 5.33/0.0036 = 647; so in this case there would be a speedup of
about 2 times; this is better for convolutional nets (you can except a speedup
of 3.0-3.9 depending on the implementation). You can do similar calculations
for model parallelism in which the 16 GPU case would fare a bit better (but it is
probably still slower than 1 GPU).
So the bottom line is that 16 GPUs with 4 PCIe lanes are quite useless for any
sort of parallelism — PCIe transfer rates are very important for multiple GPUS.
Reply
Shinji says
2015-04-10 at 10:07
Thank you for explanation.

Regarding your description, it depends on application, but the data
transfer time among GPUs is dominant in multiple GPUs environment.
However, I have another question.

In your assumption, the GPU processing time is always shorter than data
transfer time. In 16 GPUs case, GPU processing must take less than 14 msec
to process one batch. In 4 GPUs case, it must take less than about 2 msec.
If the GPU processing time is longer enough than data transfer time, the
data transfer time for synchronization is negligible. In that case, it is
important to have many GPUs rather than PCIe bandwidth.
Is my assumption unlikely in usual case?
Reply
Tim Dettmers says

2015-04-10 at 11:02
This is exactly the case for convolutional nets, where you have high
computation with small gradients (weight sharing). However, even for
convolutional nets there are limits to this; beyond eight GPUs it can
quickly become difficult to gain near-linear speedups, which is mostly
due low interconnections between computers. A 8 GPU system will be
reasonably fast with speedups of about 7-8 times for convolutional
nets, but for more than 8 GPUs you have to use normal interconnects
like infiniband. Infiniband is similar to PCIe but its speed is fixed at
about 8-25 GB/s (8GB/s is the affordable standard; 16 GB/s is
expensive; 25 GB/s is very, very expensive): So for 6 GPUs + 8GB/s
standard connection this yields a standard bandwidth of 1.6 GB/s which
is much worse than the 4 GPU 16 lanes example; for 12 GPUs this is
0.72 GB/s; 24 GPUs 0.35GB/s; 48 GPUs 0.17GB/s. So pretty quickly it will
be pretty slow even for convolutional nets.
Reply
Tim Dettmers says

2015-05-06 at 05:57
I overlooked your comment, but it is actually a very good question. It

turns out that you exactly hit the mark: The less communication is
needed the better are more GPUs compared to more bandwidth.
However, in deep learning there are only few cases where it makes
sense to trade bandwidth for more GPUs. Very deep recurrent neural
networks (time dimension) would be an example, and to some degree
(very) deep neural networks (20+ layers) are of this type. However,
even for 20+ layers you still want to look at maximizing your
bandwidth to maximize your overall performance.
For dense neural networks, anything above 4 GPUs is rather

impractical. You can make it work to run faster, but this required much
effort and several compromises in model accuracy.
Reply
Mark says
2015-04-08 at 20:41
Got a bit of a compromise i am thinking about. To save on cash in picking a CPU.

The i7 5820K and i7 5930K are the same except for pci lanes (28 versus 40).
According to this video:
https://youtu.be/rctaLgK5stA
It comes down to using say a 4th 980 or Titan otherwise if it’s three or less then
there is no real performance difference. This means a saving on the CPU of about
$200.
What’s your thoughts since you warned about the i7 5820 in your article?
Reply
Tim Dettmers says

2015-04-09 at 19:56
Yes the i7 5820K only has 28 PCIe lanes and if you buy more than one GPU I
would choose definitely a different CPU . The penalty will be observable when
you use multiple GPUs especially if you will use 4x GTX980 (personally, I would
choose a cheap CPU < $250 with 40 lanes and instead buy 4x Titan GTX X —
that will be sufficient) One note though, remember that in 2016 Q3/Q4 there
will be Pascal GPUs, which are about 10 times better than a GTX Titan X (which
is 50 % better than a GTX 980), so it might be reasonable to go with a cheaper
system and go all out once Pascal GPUs are released.
Reply
Mark says
2015-04-09 at 20:29
Well if i buy now in terms of the CPU and motherboard then I would like to
upgrade this system in a couple years to Pascal. To keep this base system
current over a few years then would you still recommend a x99
motherboard? If so then I am stuck with only two choices 5930 or 5960.
AMD has cpu’s and associated motherboards but I am not familiar with
anything going that direction. Do they have something in mind here that is
cheaper, about the same performance and can handle up to 4
980/titan/pascal GPU’s?
BTW, thought i read somewhere that no current motherboard will handle

Pascal, is that correct?
Reply
Tim Dettmers says

2015-04-09 at 20:40
A x99 motherboard might be a bit overkill. You will not need most of
its features like DDR4 RAM. As you said, the Pascal GPUs will use their
own interconnect which is much faster than PCIe — this would be
another reason to spend less money on the current system. A system
based on either the LGA1150 or the LGA2011 would be a good choice
in terms of performance/cost.
I do not have experience with AMD either, but from the calculations in
my blog post I am quite certain that it would also be a reasonable
choice. I think in the end it just comes down how much money you
have to spare.
Reply
Mark says
2015-04-09 at 21:52
Great thanks! Still one thing remain unclear to a newbie builder like
me. Is an x99 chip set wed to only motherboards which will not
work with Volta/Pascal? If not then I can just swap out the
motherboard but keep the x99 compatible CPU, memory, etc.
Also, since you are writing about convolutional nets, these are
front-ends the feed neural nets. However, there is a new paper on
using an SVM approach that needs less memory, is faster and just
as accurate as any state-of-the-art covnet/neural-net combo. It
keeps the convolution and pooling layers but replaces the neural
net with a new fast-food (LOL) version of SVM. They claim it works
“better”
“Deep Fried Convnets” by Zichao Yang, Marcin Moczulski, Misha

Denil, Nando de Freitas, Alex Smola, Le Song, Ziyu Wang.
The SVM versus neural-net battle continues.
Peyman says
2015-03-31 at 06:14
Great guide Tim, thanks.

I am wondering if you get the display output from the same GPUs which you do
the computation on?
I’m gonna buy a 40 lane i7 cpu, which is a LGA 2011 socket, along with a GTX 980. It
seems that none of the CPUs with this socket have an internal GPU to drive display.
And the other CPUs, LGA 1150 and LGA 1155, do not support more than 28 lanes.
So , the question is do I need a separate GPU to drive displays, or I can do the

compute and run the displays on the same GPU?
Reply
Tim Dettmers says

2015-03-31 at 06:34
You can use the same GPU for computation and for display, there will be no
problem. The only disadvantage is, that you have a bit less memory. I use 3x 27
inch monitors at 1920×1080 and this config uses about 300-400 MB of memory
which I hardly notice (well, I have 6GB of GPU memory). If you are worried
about that memory you can get a cheap NVIDIA GT210 (which can hold 2
monitors) for $30 and run your display on that, so that your GTX 980 is
completely free for CUDA applications.
Reply
Harry says
2017-01-11 at 03:05
I realize this is an old post but what motherboard did you pick? Most LGA2011
seem to not support dual 16x which I thought was the attraction of the 40 pcie
lanes.
Reply
Lucas Shen (@icrtiou) says

2015-03-30 at 18:13
Hi Tim,
I’m interested about the GPU bios. Can you share what bios which includes a new,
more reasonable fan schedule are you using right now? I have 2 titan x waiting to
be flashed.
Reply
Tim Dettmers says

2015-03-30 at 18:43
I do not know if GTX 970/ GTX 980 BIOS is compatible with a GTX Titan X BIOS.
Doing a quick google search, I cannot find information about a GTX Titan X
BIOS, which might be, because the card is relatively new.
I think you will find the best information in folding@home and other crowd-
computing forums (also cryptocurreny mining forums) to get this working.
Reply
Lucas Shen (@icrtiou) says

2015-03-30 at 19:24
Thanks for the pointers. fah is very interesting XD, though I don’t find titan
x bios yet. Guess I have to live with it for a while.
I saw you have plan to release deep learning library in the future. What
framework will you be working on? Torch7, Theano, Caffe?
Reply
salem ameen says

2015-03-23 at 21:28
Hi Tim,
I bought an MSI G80 Laptop to learn and work on deep learning which connects 2
GPU using SLI, could you please tell me if I could run deep learning on this laptop
even in one GPU.
Regards,
Reply
Tim Dettmers says

2015-03-24 at 06:12
Yes you will be able to use a single GPU for deep learning, SLI has nothing to
do with CUDA. Even if there are dual-GPUs (like the GTX 590) on a hardware
level you can simply access both GPUs separately. This is also true for software
libraries like theano and torch.
Reply
salemameen says
2015-03-24 at 08:01
Thanks Tim,
Because I don’t have background in coding, I want to use existing libraries.
By the way, I bought this laptop not for gaming for deep learning, I
thought would be more powerful with 2 GPUs, but even if one works fine
that is ok for me. Regards,
Reply
Tim Dettmers says

2015-03-24 at 17:18
You’re welcome! If you use Torch7 you would will be able to use both
GPUs quite easily. If you dread working with Lua (it is quite easy
actually, the most code will be in Torch7 not in Lua), I am also working
on my own deep learning library which will be optimized for multiple
GPUs, but it will take a few more weeks until it reaches a state which is
usable for the public.
Reply
Mark says
2015-03-24 at 17:54
Looking a two possible x99 boards, ASUS x99-Deluxe (~$410 US)

and ASUS Rampage V Extreme (~$450 US). Unless you know
something, I do not see that the extra $40 will make any difference
for ML but maybe it does for other stuff like multi-media or
gaming.
Will start with 16G or 32G DDR4 (haved decided yet, ~$500-$700
US).
I plan to use the 6-core i7-5930k (~$570 US). By your

recommendations of 2 cores per GPU that means max 3 GPU’s.
GTX 980’s are ~$500 US and GTX Titans ~$1000 US. Besides loss of
PCI slots, extra liquid cooling, what speed difference does one
expect in a system with two GTX 980’s versus an identical system
with one GTX Titan?
Tim Dettmers says

2015-03-31 at 06:42
I do not think the boards make a great difference, they are rather
about the chipset (x99) than anything else.
I think 6 cores should also be fine for 4 GPUs. On average, the

second core is only used sparsely, so that 3 threads can often feed
2 GPUs just fine.
One GTX Titan X will be 150% as fast as a single GTX 980, so two
GTX 980 are faster, but because one GPU is much better and
easier to use than two, I would go for the GTX Titan X if you can
afford it.
Mark says
2015-03-31 at 13:20
“One GTX Titan X will be 150% as fast as a single GTX 980, so two
GTX 980 are faster, but because one GPU is much better and
easier to use than two, I would go for the GTX Titan X if you can
afford it.”
Thanks for advice. Could you elaborate a bit more on the ease of
use between one gpu versus two?
Also, i understand the Titan will be replace this year with a faster
GTX 980 Ti. They will be the same price.
Tim Dettmers says

2015-03-31 at 13:46
If you use torch7 then it will be quite straight forward to use 2

GPUs on one problem (2 GPUs yield about 160% speed when
compared to a single GPU); other libraries do not support multiple
GPUs well (theano/pylearn2, caffe), and others are quite
complicated to use (cuda-convnet2). So 160% is not much faster
than a GTX Titan X, so if you want to also use different libraries, a
GTX Titan X would be faster overall (and more memory too!).
I am just working on a library that combines the ease of use of

torch7 with very efficient parallelizm (+190% speedup for 2 GPUs),
but it will take a month or two until I implemented all the needed
features.
benoit says
2015-03-18 at 15:25
Motherboard: Get PCIe 3.0 and as many slots as you need for your (future) GPUs
(one GPU takes two slots; max 4 GPUs per system)
just to be sure I get it.

all GPU are better on a PCIe 3.0 slot, as each GPU seems to take 2 slots (due to
size) for 3 GPU you’d need a 6 PCIe 3.0 slots MB ?
Reply
Tim Dettmers says

2015-03-18 at 15:31
That’s right, modern GPUs will run faster on a PCIe 3.0 slot.
To install a card you only need a single PCIe 3.0 slot, but because you have a
width of two PCIe slots each card will render the PCIe slot next to it unusable.
For 3 GPUs you will need 5 PCIe slots, because the first two cover 4 slots and
you will need a single fifth slot for the last GPU.
So a motherboard with 5x PCIe 3.0 x16 is fine for 3 GPUs.
Reply
benoit says
2015-03-18 at 15:39
when I was mining bitcoins (unfortunately with radeon hence why I’m so
interested in your article) I used PCI risers like
(http://www.amazon.fr/gp/product/B001CC3BNS?
psc=1&redirect=true&ref_=oh_aui_search_detailpage)
Do you think those can act as a bottleneck between the PCIe 3.0 slot of the
MB and the GPU ?
Using those could prove useful in finding cheaper MB with less PCIe 3.0
slots.
Reply
Tim Dettmers says

2015-03-18 at 16:04
I also read a bit about risers when I was building my GPU cluster, and I
often read that there was little to no degradation in performance.
However, I do not know what PCIe lane configuration (e.g. 16/8/8/8, or
16/16/8 are standard for 4 and 3 GPUs, respectively) the motherboard
will run under such a configuration and this might be a problem (the
motherboard might not support it well). For cryptocurrency mining this
is usually not a problem, because you do not have to transfer as much
data over the PCIe interface if you compare that to deep learning —
so probably there is no one that ever tested this under deep learning
conditions.
So I am not really sure how it will work, but it might be worth a try to
test this on one of your old mining motherboards and then buy a
motherboard accordingly. If you decide to do so, then please let me
know. I would be really interested in what is going on in that case and
how well it works. Thanks!
Reply
Damien MENIGAUX says

2019-05-10 at 07:06
I have tried and built many systems with passive risers. Always use
similar ones (x16 to x16, but this one seems a bit cheap.
I never could make it work with molex risers though. I would have
packet loss and it would make the training fail.
Tim Dettmers says

2019-06-13 at 19:29
I am currently using two x16-to-x16 risers and it works like a charm

to prevent overheat of 4 RTX 2080 Tis in one case. Some other
students at the University of Washington use the same setup with
great success.
Stijn says
2015-03-15 at 07:44
What is the largest dataset you can analyze, you can choose the specs you want,
and how much time would it take?
Reply
Tim Dettmers says

2015-03-16 at 18:59
The sky is the limit here. Google ran conv nets that took months to complete
and which were run on thousands of computers. For practical data sets,
ImageNet is one of the larger data sets and you can expect that new data sets
will grow exponentially from there. These data sets will grow as your GPUs get
faster, so you can always expect that the state of the art on a large popular
data set will take about 2 weeks to train.
Reply
Mark says
2015-03-14 at 18:26
What motherboards by company and model number do you recommend (ASUS,

MSI, etc) for a home PC that will be used for multimedia as well (not concerned
with gaming). I am thinking of using a single GTX 980 but may think about add
more GPU’s later(not a crucial concern). Also, what i7 cpu models do I need?
Thanks for the help and the suggestion of the 960 alternative to the 580. I am
learning Torch 7 and can afford the 980.
Reply
Tim Dettmers says

2015-03-16 at 18:57
I only have experience with motherboards that I use, and one of them has a
minor hardware defect and thus I do not think my experience is representative
for the overall mainboard product, and this is similar for other hardware pieces.
I think with the directions I gave in this guide you can find your pieces on your
own through lists that feature user rating like http://pcpartpicker.com/parts/
Often it is quite practical to sort by rating and buy the first highly rated
hardware piece which falls in your budget.
Reply
Felix Lau says

2015-03-14 at 10:51
Thanks for this great post!
What’s your thought on using g2.xlarge instead of building the hardware? I believe
g2.xlarge is a lot slower than GTX 980. However it is possible to spawn many
instances on AWS at the same time which might be useful for tuning
hyperperameter.
Reply
Tim Dettmers says

2015-03-16 at 18:52
Indeed the g2.xlarge is much slower than the GTX 980, but also much cheaper.
It is a cheap option if want to train multiple independent neural nets, but it can
be very messy. I only have experience with regular CPU instances, but with
those it can take considerable time to manage one’s instances, especially if you
are using AWS for large data sets together with spot instances — you will
definitely be more productive with a local system. But in terms of affordability
GPU instances are just the best.
I just want you make you aware of other downsides with GPU instances, but the
overall conclusion stays the same (less productivitz, but very cheap): You
cannot use multiple GPUs on AWS instances because the interconnect is just
too slow and will be a major bottle neck (4 GPUs will run slower than 2). Also
the PCIe interconnect performance is crippled by the virtualization. This can be
partly improved by a hacky patch, but overall the performance will still be bad
(it might well be that 2 GPUs are worse than 1 GPU).
Also like the GTX 580, the GPU instances do not support newer software, and
this can be quite bad if you want to run modern variants of convolutional nets.
Reply
Mark Trovinger says

2015-03-13 at 17:55
What IDE are you using in that pic? It looks like Eclipse but I can’t quite tell. Great
article, a full breakdown is just what I needed!
Reply
Tim Dettmers says

2015-03-13 at 18:04
Glad that you liked the article. I am using Eclipse (NVIDIA Nsight) for
C++/C/CUDA in that pic; I also use Eclipse for Python (PyDev) and Lua
(Koneki). While I am very satisfied with Eclipse for Python and CUDA, I am less
satisfied with Eclipse for Lua (that is torch7) and I probably will switch to Vim for
that.
Reply
sshidy2014 says
2015-03-12 at 23:03
About (possibly) multiple GPUs, would nvidia SLI be of any significant help?
Reply
Tim Dettmers says

2015-03-13 at 06:50
Thanks for your comment. NVIDIA SLI is an interface which allows to render
computer graphics frames on each GPU and exchange them via SLI. The use of
SLI is limited to this application, so doing computations and parallelizing them
via SLI is not possible (one needs to use the PCIe interface for this). So CUDA
cannot use SLI.
Reply
Sancho McCann says

2015-03-11 at 23:43
Thoughts on the Tesla K40? It’s one of the GPUs available through NVIDIA’s
academic hardware grant
program: https://developer.nvidia.com/academic_hw_seeding
Reply
Tim Dettmers says

2015-03-12 at 06:28
A K40 will be similar to a GTX Titan in terms of performance. The additional

memory will be great if you train large conv nets and this is the main
advantage of a K40. If you can choose the upcoming GTX Titan X in the
academic grant program, this might be the better choice as it is much faster
and will have the same amount of memory.
Reply
dh says
2015-03-20 at 21:02
why is k40 much more expensive when gtx x is cheaper but has more cores
and higher bandwidth?
Reply
Tim Dettmers says

2015-03-20 at 21:07
The K40 is a compute card which is used for scientific applications

(often system of partial differential equations) which require high
precision. Tesla cards have additional double precision and memory
correction modules which makes them excel at high precision tasks;
these extra features, which are not needed in deep learning, make
them so expensive.
Reply
zeecrux says
2015-07-03 at 10:00
ImageNet on K40:
Training is 19.2 secs / 20 iterations (5,120 images) – with cuDNN
and GTX770:
cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
(source: http://caffe.berkeleyvision.org/performance_hardware.html)
I trained ImageNet model on a GTX 960 and have this result:

Training is around 26 secs / 20 iterations (5,120 images) – with cuDNN
So GTX 960 is close to GTX 770
So for 450000 iterations, it takes 120 hours (5 days) on K40, and 162.5
hours (6.77 days) on GTX 960.
Now K40 costs > 3K USD, and GTX 960 costs < 300 USD
Reply
Tim Dettmers says

2015-07-03 at 11:05
Thanks, this is very useful information!
Reply
Hannes says
2015-03-11 at 03:45
I find the recommendation of the GTX 580 for *any* kind of deep learning or
budget a little dubious since it doesn’t support cuDNN. What good is a GPU that
doesn’t support what’s arguably the most important library for deep learning at the
moment?
Reply
Tim Dettmers says

2015-03-11 at 09:14
This is a really good and important point. Let me explain my reasoning why I
think a GTX 580 is still good.
The problem with no cuDNN support is really that you will require much more
time to set everything up and often cutting-edge features that are
implemented in libraries like torch7 will not be available. But it is not impossible
to do deep learning on a GTX 580 and good, usable deep learning software
exists. One will probably need to learn CUDA programming to add new
features through one’s own CUDA kernels, but this will just require time and not
money. For some people time and effort is relatively cheap, while money is
rather expensive. If you think about students in developing countries this is very
much true; if you earn $5500 a year (average GDP per capita ppp of India; for
the US this is $53k – so think about your GPU choice if you had 10 times less
money) then you will be happy that there is a deep learning option that costs
less than $120. Of course I could recommend cards, like the GTX 750, which are
also in that price range and which work with cuDNN, but I think a GTX 580
(must faster and more memory) is just better than a GTX 750 (cuDNN support)
or other alternatives.
EDIT: I think it might be good to add another option, which offers support for
cuDNN but which is rather cheap, like the GTX 960 4GB (only a bit slower than
the GTX 580) which will be available shortly for about $250-300. But as you
see, an additional $130-180 can be very painful if you are a student in a
developing country.
Reply
DarkIdeals says
2016-09-08 at 08:33
A great 2016 update if you happen to still frequent this blog (don’t see any
recent posts) is the new GTX 1060 Pascal graphic card. Specifically the 3GB
model. Now 3GB is definitely cutting a tad close on memory, however it’s a
VASTLY superior choice to both a 580 AND a 960 4gb. The 1060 6GB
model is equivalent to a GTX 980 in overall performance, and the 3GB 1060
model is only ever-so-slightly weaker putting it at the level of a hugely
overclocked GTX 970 (i’m talking like ~1,650mhz 970 levels. Which is
maybe ~5% below a 980)
And the 3GB 1060 can be had for a measly $199 BRAND NEW! It’s definitely
something to consider at least. And if you still desperately need that extra
VRAM then you can even get the 6GB version of the 1060 (which as i
mentioned is literally about tied with an average GTX 980! ) can be had for
as little as $249 right now!
Reply
Tim Dettmers says

2016-09-10 at 03:54
I updated my GPU recommendation post with the GTX 1060, but I did
not mention the 3GB version, that did not exist at that time. Thanks for
letting me know!
Reply
Khalid says
2017-04-13 at 09:02
Hi,
I want to get a system with GPU for speech processing and deep
learning application using python language.
Can you please inbox me the reasonable system hardware

requirements?
Tim Dettmers says

2017-04-13 at 14:54
For these applications a “standard” deep learning system will be

sufficient. You can find examples of such systems in the comments
section (search for “pcpartpicker” and you will probably find some
examples).
Rusty Scupper says

2015-03-10 at 02:01
what,the.heck… Could you have skipped the blather and gotten to the point? There
are only a few specific combinations that support what you were trying to explain
so maybe something like:
– GTX 580/980
– i5 / i7 CPU
– Lots of ram (duh)
– Fast hard drive
Reply
Tim Dettmers says

2015-03-10 at 08:38
Give a man a fish and you feed him for a day; teach a man to fish and you feed
him for a lifetime.
Reply
zeng says
2016-08-29 at 05:17
授人以鱼不如授人以渔. same proverb as in Chinese.

very helpfull, thanks for the sharing.
Reply
lU says
2015-03-09 at 22:59
Covers everything i wanted to know and even more, thanks!
It also confirms my choice for a pentium g3258 for a single GPU config. Insanely
cheap, and even has ecc memory support, something that some folks might want
to have..
Reply
cicero19 says
2015-03-09 at 20:36
Hi Tim,
This is a great overview. Wondering if you could recommend any cost-effective

CPUs with 40 PCIe lanes.
Thanks!
Reply
Tim Dettmers says
2015-03-10 at 08:41
There are many CPUs in all different price ranges which are all reasonable
choices and most CPUs support 40 PCIe lanes. The best practice is probably to
look at site like http://pcpartpicker.com/parts/cpu/ an select a CPU with a good
rating and a good price; then check if it supports the 40 lanes and you will be
good to go.
Reply
Alexander Rezanov says

2019-06-20 at 10:35
Good afternoon. Can you please help me. There is a used computer offer
in my neighbourhood for about 800$.
i7 4790k
MSI 1080
DDR3 4Gx2 + DDR3 8Gx2
wd 1tb ssd
Is it a good choice for deep learning for beginning?
Reply
Tim Dettmers says

2019-06-29 at 08:23
Yes, it would be okay for a beginner. You can run most models and
explore deep learning problems. You will not be able to run some of
the largest deep learning models, but that should not be your goal
when you learn and explore deep learning anyway.
Reply
Alexander Rezanov says

2019-10-05 at 13:06
Execuse me for not topic question. But are you familiar with
tensorflow?
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Save my name, email, and website in this browser for the next time I comment.
Notify me of follow-up comments by email.
Notify me of new posts by email.
POST COMMENT

A Full Hardware Guide To Deep Learning - Tim Dettmers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Full Hardware Guide To Deep Learning - Tim Dettmers

Uploaded by

Copyright:

Available Formats

2024/4/2 12:40 A Full Hardware Guide to Deep Learning — Tim Dettmers

A Full Hardware Guide to

Research that is hunting state-of-the-art scores: >=11 GB

Needed RAM Clock Rate

A different strategy is influenced by psychology: Psychology tells us that concentration

CPU and PCI-Express

Forward and backward pass: 216 milliseconds (ms)

PCIe Lanes and Multi-GPU Parallelism

Needed CPU Cores

The first strategy is preprocessing while you train:

The second strategy is preprocessing before any training:

Needed CPU Clock Rate (Frequency)

CPU underclocking on MNIST and ImageNet: Performance is

Power supply unit (PSU)

CPU and GPU Cooling

Air Cooling GPUs

Water Cooling GPUs For Multiple

A Big Case for Cooling?

Typical monitor layout when I do deep learning: Left: Papers,

Some words on building a PC

Update 2018-12-14: Reworked entire blog post with up-to-date recommendations.

How To Build and Use a Multi Deep Learning Hardware Limbo

Filed Under: Hardware

Tim Dettmers says

6GB is indeed a bit small – I would go for the 8 GB GPU

Tim Dettmers says

– Dual CPU – Xeon E5 2670 – V2 10 cores each, 64GB RAM

(ii) What do you think of this PSU from EVGA:

1. Version of Python and tf should match (please verifiy spec on tf software

2. CPU binary precompiled version of Tf can include CPU specific optimizations

Generally this happen when a different cpu architecture binary is launched on

How to solve it:

Edit: I enabled pcie 3.0 via force-enable-gen3.exe

Thanks for this article and late happy new year!

Here is my current PCPartsPicker list too:

CPU: AMD Ryzen 9 3900XT 3.8 GHz 12-Core Processor

Frederick Carlson says

Scary. This is very similar to my build

Great article and site – BTW

CPU: Intel Core i9-10900X Cascade Lake 3.7GHz Ten-Core

As an example, let’s consider the MSI Z490-A PRO ATX (https://www.cyberport.de/?

It says that this Motherboard has the following specs:

(ii) Does it make sense to buy a 3-year warranty for a Motherboard?

Tim Dettmers says

Thanks so much, I really appreciate any thoughts that you have!

Tim Dettmers says

Brandon Wolfson says

3) Lastly, do you have any thoughts on vertical monitors or ultra-wide monitors?

Here is my PCPartsPicker list too:

CPU: Intel Core i9-10900F 2.8 GHz 10-Core Processor

Brandon Wolfson says

Tim Dettmers says

1) 10 cores is plenty, you will be fine!

The build looks solid to me!

I am building this machine on a budget of around 3500 Dollars. This is my build:

GPU: 1x ZOTAC RTX 3090 (~ 2000 $)

CPU: AMD RYZEN 5 5600X PROCESSOR (UPTO 4.6 GHZ) (~ 440$)

SSD: 1x Samsung 970 PRO NVME M.2 512GB SSD (~ 365$)

Hard Disk: 1x WD BLUE 2TB INTERNAL HDD (~ 65$)

PSU: ANTEC HCP1000 80 PLUS PLATINUM SERIES 1000 WATTS (~ 160$)

Case: DEEPCOOL GAMERSTORM MACUBE (~ 110$)

MotherBoard: GIGABYTE B550 AORUS ELITE AX WIFI (~ 215$)

Any suggestion on this build?