I install nvidia-docker in our lab's workstation recently since many people use it and they need different environments,but I find that it become much slower than before ,not only tensorflow ,mxnet is in the situation.do I need to change some config? and how to do that? thanks for any help!
ps : what I pull is tensorflow/tensorflow : 0.10.0rc0-gpu
This is probably because there is something in your environment that is different, maybe you compiled the project differently. We didn't notice any performance difference with Docker for all the deep learning frameworks we tested.
Also, you need to be careful regarding I/O, during training you shouldn't read or write to the container filesystem, it will be slower. All read/writes should be from/to volumes mounted from the host. For instance, this is what we recommend for DIGITS:
nvidia-docker run --name digits -d -p 5000:34448 -v /opt/mnist:/data/mnist -v digits-jobs:/jobs nvidia/digits
It also means your training jobs will not be deleted when the container is removed, which is probably what you want.
@flx42 thanks a lot!
this is what I used for tensorflow:
sudo nvidia-docker run -it -p 8883:8888 -v /home/common/docker_data/tensorflow:/home/data tensorflow/tensorflow:0.10.0rc0-gpu jupyter-notebook --no-brower --ip=0.0.0.0 --notebook-dir='/'
I read and write in /home/data,I think maybe it slower because of compile setttings ,I will try to find out it.
@nightinwhite did you find the reason of your slowdown?
I find that perf top reports as follows during slow-downs (persisting for about a minute):

It may be having some problems with compiling CUDA codes from Tensorflow...?
Would be there any way to inspect some detailed logs on what's happening there?
Oh, I have figured out what's happening.
On the first run of Tensorflow on a fresh container, it compiles its CUDA codes and caches them.
If I run the same test case that uses several TensorFlow modules twice without re-creating the container, then the second run takes just 2-3 seconds while the first run takes more than 90 seconds. Now we need to know how to reliably build caches upon container deployment (to machines that may have different GPU models)...
Still the remaining question is: why doesn't this slow-down happen with nvidia-docker command?
Make sure Tensorflow is compiled for your CUDA architecture (with TF_CUDA_COMPUTE_CAPABILITIES).
As for the JIT, see CUDA_CACHE_PATH if you want configure the cache location.
I'm using the version from pip:
# Ubuntu/Linux 64-bit, GPU enabled, Python 3.5
# Requires CUDA toolkit 8.0 and CuDNN v5. For other versions, see "Install from sources" below.
$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.11.0rc2-cp35-cp35m-linux_x86_64.whl
It seems to be JIT-ing on the first run.
Now, my questions are:
You can't do that easily, this is why I suggest you build tensorflow yourself to target your specific compute architecture (see official dockerfile).
The JIT compilation is done lazily, the best you can do if you rely on it is to use CUDA_CACHE_PATH with a Docker volume to speed up further container launches.
@achimnol: also, if you are rely on TensorFlow operations that use custom CUDA code (instead of cuDNN), compiling directly for your GPU architecture could yield better performance than having to JIT from a previous architecture (even if you exclude the JIT time).
@3XX0 Thanks for answers. I finally fall back to rebuilding TensorFlow by myself using customized CUDA compute capability options, and now the first-run delay has gone.
@flx42 Currently I don't have custom operators but will take your note when I have them!
Most helpful comment
This is probably because there is something in your environment that is different, maybe you compiled the project differently. We didn't notice any performance difference with Docker for all the deep learning frameworks we tested.
Also, you need to be careful regarding I/O, during training you shouldn't read or write to the container filesystem, it will be slower. All read/writes should be from/to volumes mounted from the host. For instance, this is what we recommend for DIGITS:
It also means your training jobs will not be deleted when the container is removed, which is probably what you want.