Nvidia-docker: libcuda.so can not be found in /usr/local/cuda/lib64 when building mxnet in nvidia/cuda docker

Created on 21 Jan 2016 · 10Comments · Source: NVIDIA/nvidia-docker

I am trying to make a Dockerfile that compiles mxnet using nvidia-docker, based on the nvidia/cuda image. mxnet uses variable

USE_CUDA_PATH

in its make script to set the location of the cuda driver. It seems to ignore LD_LIBRARY_PATH
Usually you would set this to be /usr/local/cuda/lib64/ the libcudnn.so library can indeed be found there, for example.

In the nvidia/cuda Docker image however there is no libcuda.so in /usr/local/cuda/lib64, instead it seems to be located in /usr/local/nvidia/lib64/

Funnily enough, when I ln -s this libcuda.so.1 to /usr/local/cuda/lib64 it does build from within nvidia-docker run nvidia/cuda, but gives me a -lcuda not found error when performing the same command in "nvidia-docker build ..."

Is there a way to get libcuda.so in the /usr/local/cuda/lib64 directory during the nvidia-docker build?

question

Source

rdwrt

All 10 comments

I made a workaround by using the stub libcuda.so during build

At runtime I copy the /usr/local/nvidia/lib64/ before calling mxnet from R.

Did I do it in a correct way like this? Are there alternative ways to do this?

rdwrt on 21 Jan 2016

Not sure why, but it looks like mxnet is using both the CUDA runtime API (libcudart.so) and the CUDA driver API (libcuda.so). libcudart.so is linked automatically by nvcc so you're fine with the CUDA runtime.
Regarding the CUDA driver though, it will only be present in the container at runtime (in /usr/local/nvidia/lib64) so as you figured it out, you will need to compile the code against the libcuda.so stub (/usr/local/cuda/lib64/stubs) when you build the container.

At runtime, you have two solution:

If nothing has overridden LD_LIBRARY_PATH, you have nothing to do because the nvidia/cuda image sets it properly.
If something tampered with LD_LIBRARY_PATH the easiest way is to execute ldconfig before your command:

CMD ldconfig && <MXNET_COMMAND>

3XX0 on 21 Jan 2016

👍1

So after further review, we are missing the CUDA driver stubs in our CUDA images.
Not sure why, that's something we need to fix.

3XX0 on 21 Jan 2016

Thanks for the quick reply!

In the 7.5 image I only could find the cuda driver stubs at:

/usr/local/cuda-7.5/targets/x86_64-linux/lib/stubs/libcuda.so

I suppose there should be symbolic links at /usr/local/cuda etc.

I couldn't find any documentation on how to compile and then run code within the image, maybe an idea to put that somewhere in the README.md file?

I will try out the CMD ldconfig &&

rdwrt on 21 Jan 2016

My bad my image was corrupted, we do include it.

Compiling/running code is done through your Dockerfile (see documentation)
In your case, I'm guessing it would look like that:

FROM nvidia/cuda:cudnn

RUN git clone <MXNET_REPO>

RUN sed <MXNET_CONFIG>
# Something along these lines
# ADD_LDFLAGS = -L /usr/local/cuda/lib64/stubs
# USE_CUDA = 1
# USE_CUDNN = 1

RUN make

CMD <MXNET_COMMAND>

3XX0 on 21 Jan 2016

👍1

Thanks for the helpful pointers!

The nvidia-docker wrapper works pretty great!

rdwrt on 21 Jan 2016

@3XX0 , I am having a problem that relates. I use

FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04

but there is no libcuda.so file to be found anywhere. I search:

sudo find /usr/ -name 'libcuda.so.1'

but no luck. Any idea of what I am doing wrong? It seems like tensorflow used to import in 1.0.0 but just say it couldn't find it. Now in 1.2.0, it will not even import.

ljstrnadiii on 20 Jun 2017

@ljstrnadiii is it during a docker build or docker run?
During a docker build, you can't use GPUs (nvidia-docker does nothing). But you can compile code against libcuda.so by using the stubs from the CUDA toolkit in /usr/local/cuda/lib64/stubs/

flx42 on 20 Jun 2017

@flx42 ,
During a docker run. For now, I am working in the docker image until debug everything. When I removed a WORKDIR in the dockerfile and built again the file is suddenly found here:

/usr/local/nvidia/lib64/libcuda.so.1

After exiting the gcp server and ssh back in I just ran the same container and suddenly nvidia-smi does not even work and libcuda.so.1 is no where to be found.

I am pretty confused. I wish there was tighter integration between nvidia and tensorflow.

I really just want to be able to build an image to run tf apps

EDIT: I guess I should start by calling nvidia-docker...

ljstrnadiii on 20 Jun 2017

Yes, you need to use nvidia-docker run

flx42 on 20 Jun 2017

Was this page helpful?

0 / 5 - 0 ratings