Nvidia-docker: use nvidia-docker to run tensorflow is much slower

Created on 20 Oct 2016 · 10Comments · Source: NVIDIA/nvidia-docker

I install nvidia-docker in our lab's workstation recently since many people use it and they need different environments,but I find that it become much slower than before ,not only tensorflow ,mxnet is in the situation.do I need to change some config? and how to do that? thanks for any help!

ps : what I pull is tensorflow/tensorflow : 0.10.0rc0-gpu

work as intended

Source

nightinwhite

Most helpful comment

This is probably because there is something in your environment that is different, maybe you compiled the project differently. We didn't notice any performance difference with Docker for all the deep learning frameworks we tested.

Also, you need to be careful regarding I/O, during training you shouldn't read or write to the container filesystem, it will be slower. All read/writes should be from/to volumes mounted from the host. For instance, this is what we recommend for DIGITS:

nvidia-docker run --name digits -d -p 5000:34448 -v /opt/mnist:/data/mnist  -v digits-jobs:/jobs nvidia/digits

It also means your training jobs will not be deleted when the container is removed, which is probably what you want.

flx42 on 20 Oct 2016

👍8

All 10 comments

nvidia-docker run --name digits -d -p 5000:34448 -v /opt/mnist:/data/mnist  -v digits-jobs:/jobs nvidia/digits

It also means your training jobs will not be deleted when the container is removed, which is probably what you want.

flx42 on 20 Oct 2016

👍8

@flx42 thanks a lot!
this is what I used for tensorflow:
sudo nvidia-docker run -it -p 8883:8888 -v /home/common/docker_data/tensorflow:/home/data tensorflow/tensorflow:0.10.0rc0-gpu jupyter-notebook --no-brower --ip=0.0.0.0 --notebook-dir='/'
I read and write in /home/data,I think maybe it slower because of compile setttings ,I will try to find out it.

nightinwhite on 24 Oct 2016

@nightinwhite did you find the reason of your slowdown?

flx42 on 31 Oct 2016

I find that perf top reports as follows during slow-downs (persisting for about a minute):

It may be having some problems with compiling CUDA codes from Tensorflow...?
Would be there any way to inspect some detailed logs on what's happening there?

achimnol on 9 Nov 2016

Oh, I have figured out what's happening.
On the first run of Tensorflow on a fresh container, it compiles its CUDA codes and caches them.
If I run the same test case that uses several TensorFlow modules twice without re-creating the container, then the second run takes just 2-3 seconds while the first run takes more than 90 seconds. Now we need to know how to reliably build caches upon container deployment (to machines that may have different GPU models)...

Still the remaining question is: why doesn't this slow-down happen with nvidia-docker command?

achimnol on 9 Nov 2016

👍3

Make sure Tensorflow is compiled for your CUDA architecture (with TF_CUDA_COMPUTE_CAPABILITIES).
As for the JIT, see CUDA_CACHE_PATH if you want configure the cache location.

3XX0 on 9 Nov 2016

👍1

I'm using the version from pip:

# Ubuntu/Linux 64-bit, GPU enabled, Python 3.5
# Requires CUDA toolkit 8.0 and CuDNN v5. For other versions, see "Install from sources" below.
$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0rc2-cp35-cp35m-linux_x86_64.whl

It seems to be JIT-ing on the first run.

Now, my questions are:

How do I manually make TensorFlow to build caches for all registered kernels programmatically?
How do I make the above step to be included in Dockerfile with nvidia-docker?

achimnol on 9 Nov 2016

You can't do that easily, this is why I suggest you build tensorflow yourself to target your specific compute architecture (see official dockerfile).

The JIT compilation is done lazily, the best you can do if you rely on it is to use CUDA_CACHE_PATH with a Docker volume to speed up further container launches.

3XX0 on 9 Nov 2016

@achimnol: also, if you are rely on TensorFlow operations that use custom CUDA code (instead of cuDNN), compiling directly for your GPU architecture could yield better performance than having to JIT from a previous architecture (even if you exclude the JIT time).

flx42 on 9 Nov 2016

@3XX0 Thanks for answers. I finally fall back to rebuilding TensorFlow by myself using customized CUDA compute capability options, and now the first-run delay has gone.

@flx42 Currently I don't have custom operators but will take your note when I have them!

achimnol on 10 Nov 2016

👍1

Was this page helpful?

0 / 5 - 0 ratings