I am trying to make a Dockerfile that compiles mxnet using nvidia-docker, based on the nvidia/cuda image. mxnet uses variable
USE_CUDA_PATH
in its make script to set the location of the cuda driver. It seems to ignore LD_LIBRARY_PATH
Usually you would set this to be /usr/local/cuda/lib64/ the libcudnn.so library can indeed be found there, for example.
In the nvidia/cuda Docker image however there is no libcuda.so in /usr/local/cuda/lib64, instead it seems to be located in /usr/local/nvidia/lib64/
Funnily enough, when I ln -s this libcuda.so.1 to /usr/local/cuda/lib64 it does build from within nvidia-docker run nvidia/cuda, but gives me a -lcuda not found error when performing the same command in "nvidia-docker build ..."
Is there a way to get libcuda.so in the /usr/local/cuda/lib64 directory during the nvidia-docker build?
I made a workaround by using the stub libcuda.so during build
At runtime I copy the /usr/local/nvidia/lib64/ before calling mxnet from R.
Did I do it in a correct way like this? Are there alternative ways to do this?
Not sure why, but it looks like mxnet is using both the CUDA runtime API (libcudart.so) and the CUDA driver API (libcuda.so). libcudart.so is linked automatically by nvcc so you're fine with the CUDA runtime.
Regarding the CUDA driver though, it will only be present in the container at runtime (in /usr/local/nvidia/lib64) so as you figured it out, you will need to compile the code against the libcuda.so stub (/usr/local/cuda/lib64/stubs) when you build the container.
At runtime, you have two solution:
LD_LIBRARY_PATH, you have nothing to do because the nvidia/cuda image sets it properly.LD_LIBRARY_PATH the easiest way is to execute ldconfig before your command:CMD ldconfig && <MXNET_COMMAND>
So after further review, we are missing the CUDA driver stubs in our CUDA images.
Not sure why, that's something we need to fix.
Thanks for the quick reply!
In the 7.5 image I only could find the cuda driver stubs at:
/usr/local/cuda-7.5/targets/x86_64-linux/lib/stubs/libcuda.so
I suppose there should be symbolic links at /usr/local/cuda etc.
I couldn't find any documentation on how to compile and then run code within the image, maybe an idea to put that somewhere in the README.md file?
I will try out the CMD ldconfig &&
My bad my image was corrupted, we do include it.
Compiling/running code is done through your Dockerfile (see documentation)
In your case, I'm guessing it would look like that:
FROM nvidia/cuda:cudnn
RUN git clone <MXNET_REPO>
RUN sed <MXNET_CONFIG>
# Something along these lines
# ADD_LDFLAGS = -L /usr/local/cuda/lib64/stubs
# USE_CUDA = 1
# USE_CUDNN = 1
RUN make
CMD <MXNET_COMMAND>
Thanks for the helpful pointers!
The nvidia-docker wrapper works pretty great!
@3XX0 , I am having a problem that relates. I use
FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04
but there is no libcuda.so file to be found anywhere. I search:
sudo find /usr/ -name 'libcuda.so.1'
but no luck. Any idea of what I am doing wrong? It seems like tensorflow used to import in 1.0.0 but just say it couldn't find it. Now in 1.2.0, it will not even import.
@ljstrnadiii is it during a docker build or docker run?
During a docker build, you can't use GPUs (nvidia-docker does nothing). But you can compile code against libcuda.so by using the stubs from the CUDA toolkit in /usr/local/cuda/lib64/stubs/
@flx42 ,
During a docker run. For now, I am working in the docker image until debug everything. When I removed a WORKDIR in the dockerfile and built again the file is suddenly found here:
/usr/local/nvidia/lib64/libcuda.so.1
After exiting the gcp server and ssh back in I just ran the same container and suddenly nvidia-smi does not even work and libcuda.so.1 is no where to be found.
I am pretty confused. I wish there was tighter integration between nvidia and tensorflow.
I really just want to be able to build an image to run tf apps
EDIT: I guess I should start by calling nvidia-docker...
Yes, you need to use nvidia-docker run