Serving: Docker with GPU failed call to cuInit: CUresult(-1)

Created on 26 Jul 2018  Â·  16Comments  Â·  Source: tensorflow/serving

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 17.10
  • TensorFlow Serving installed from (source or binary): binary
  • TensorFlow Serving version: 1.9
  • Docker version: 18.03.1-ce
  • Nvidia docker version: 2.0.3

    Describe the problem

I'm attempting to run a tensorflow serving in a container which needs GPU.
When I'm starting the container and use it I don't see the process in host of nvidia-smi
Looking at the log I saw few weird issues.

Exact Steps to Reproduce

This is a simple example which shows the same error
docker run -p 8501:8501 \ -v /tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_three:/models/half_plus_three \ -e MODEL_NAME=half_plus_three -t tensorflow/serving:1.9.0-devel-gpu \ tensorflow_model_server \ --port=8500 \ --rest_api_port=8501 \ --model_name=half_plus_three \ --model_base_path=/models/half_plus_three

Source code / logs

2018-07-26 05:55:57.044214: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-26 05:55:57.044874: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(-1)
2018-07-26 05:55:57.045256: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program

Most helpful comment

run rm /usr/local/cuda/lib64/stubs/libcuda.so.1 fixed my problem

All 16 comments

Likely the cuda stubs used in building are still there.

I managed to comment them out and build with nvidia-docker for r1.9 to work around this

A good way to test if the GPU drivers of your container are setup correctly before you start building the model server is this script, which should return the details of your video card without any dependencies: https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549

Could you help me with the workaround? What exactly did you comment out?

Also I'm not using the image as build server, rather as a production server. Loading the necessary model and actually using it.

I tried running the gist and got :

Traceback (most recent call last):
File "cuda_check.py", line 117, in
sys.exit(main())
File "cuda_check.py", line 52, in main
raise OSError("could not load any of: " + ' '.join(libnames))
OSError: could not load any of: libcuda.so libcuda.dylib cuda.dll

Ok so here a link to the Dockerfile and changes I made to create a working instance

https://github.com/tensorflow/tensorflow/issues/19840

Maybe a double check, are you running in nvidia-docker and can you run


?

nvidia-smi works on the container, the python script doesn't.
I haven't tested the modified image yet. Shouldn't it be published? It seems to cause a lot of issues.

I tried with the modified image and it works! I mean, I'm getting OOM but I guess that's another issue.
Thank you!

On Thu, Jul 26, 2018, 12:11 rdwrt notifications@github.com wrote:

Maybe a double check, are you running in nvidia-docker and can you run

nvidia-smi

?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/serving/issues/1015#issuecomment-408031664,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB9y2cQuAwSGq5xBFdT_TezFNekyfW0cks5uKYe0gaJpZM4VhOVD
.

docker run -p 8501:8501 -v

You need to use nvidia-docker to run the GPU build.

I spent the last day debugging the same error in a similar configuration (Ubuntu 16.04, TFServing 1.9, Tesla P100). The GPU worked fine in tensorflow/tensorflow. Running in tensorflow/serving:nightly-devel-gpu fixed the problem.

https://github.com/tensorflow/serving/commit/4cbac38c307ea11527d0e45a3b18fd41f1b67601#diff-5442e32f8ca43e5ee752e24804404913

This should be fixed in the next release (and is fixed in the master branch)

@gautamvasudevan

The following lines that probably break the CUDA functionality in nvidia-docker are still included in the next release (r1.10) : https://raw.githubusercontent.com/tensorflow/serving/r1.10/tensorflow_serving/tools/docker/Dockerfile.devel-gpu

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH}

That's intentional - they only apply to that command, and are needed so the stubs exist to build the binary.

But the binary can be built succesfully without the stubs! And the stubs stop the container from using the GPU when running in nvidia-docker.

run rm /usr/local/cuda/lib64/stubs/libcuda.so.1 fixed my problem

Was this page helpful?
0 / 5 - 0 ratings