Serving: Docker with GPU failed call to cuInit: CUresult(-1)

Created on 26 Jul 2018 · 16Comments · Source: tensorflow/serving

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 17.10
TensorFlow Serving installed from (source or binary): binary
TensorFlow Serving version: 1.9
Docker version: 18.03.1-ce
Nvidia docker version: 2.0.3

Describe the problem

I'm attempting to run a tensorflow serving in a container which needs GPU.
When I'm starting the container and use it I don't see the process in host of nvidia-smi
Looking at the log I saw few weird issues.

Exact Steps to Reproduce

This is a simple example which shows the same error
docker run -p 8501:8501 \ -v /tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_three:/models/half_plus_three \ -e MODEL_NAME=half_plus_three -t tensorflow/serving:1.9.0-devel-gpu \ tensorflow_model_server \ --port=8500 \ --rest_api_port=8501 \ --model_name=half_plus_three \ --model_base_path=/models/half_plus_three

Source code / logs

2018-07-26 05:55:57.044214: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-26 05:55:57.044874: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(-1)
2018-07-26 05:55:57.045256: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program

Source

aclowkey

👍3

Most helpful comment

run rm /usr/local/cuda/lib64/stubs/libcuda.so.1 fixed my problem

CLIsVeryOK on 13 Aug 2018

👍8 👎3 🎉2

All 16 comments

Likely the cuda stubs used in building are still there.

I managed to comment them out and build with nvidia-docker for r1.9 to work around this

rdwrt on 26 Jul 2018

A good way to test if the GPU drivers of your container are setup correctly before you start building the model server is this script, which should return the details of your video card without any dependencies: https://gist.github.com/f0k/63a664160d016a491b2cbea15913d549

rdwrt on 26 Jul 2018

👍2

Could you help me with the workaround? What exactly did you comment out?

aclowkey on 26 Jul 2018

Also I'm not using the image as build server, rather as a production server. Loading the necessary model and actually using it.

aclowkey on 26 Jul 2018

I tried running the gist and got :

Traceback (most recent call last):
File "cuda_check.py", line 117, in
sys.exit(main())
File "cuda_check.py", line 52, in main
raise OSError("could not load any of: " + ' '.join(libnames))
OSError: could not load any of: libcuda.so libcuda.dylib cuda.dll

aclowkey on 26 Jul 2018

Ok so here a link to the Dockerfile and changes I made to create a working instance

https://github.com/tensorflow/tensorflow/issues/19840

rdwrt on 26 Jul 2018

Maybe a double check, are you running in nvidia-docker and can you run

rdwrt on 26 Jul 2018

nvidia-smi works on the container, the python script doesn't.
I haven't tested the modified image yet. Shouldn't it be published? It seems to cause a lot of issues.

I tried with the modified image and it works! I mean, I'm getting OOM but I guess that's another issue.
Thank you!

On Thu, Jul 26, 2018, 12:11 rdwrt notifications@github.com wrote:

Maybe a double check, are you running in nvidia-docker and can you run

nvidia-smi

?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/serving/issues/1015#issuecomment-408031664,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB9y2cQuAwSGq5xBFdT_TezFNekyfW0cks5uKYe0gaJpZM4VhOVD
.

aclowkey on 26 Jul 2018

docker run -p 8501:8501 -v

You need to use nvidia-docker to run the GPU build.

gautamvasudevan on 26 Jul 2018

👍1

I spent the last day debugging the same error in a similar configuration (Ubuntu 16.04, TFServing 1.9, Tesla P100). The GPU worked fine in tensorflow/tensorflow. Running in tensorflow/serving:nightly-devel-gpu fixed the problem.

https://github.com/tensorflow/serving/commit/4cbac38c307ea11527d0e45a3b18fd41f1b67601#diff-5442e32f8ca43e5ee752e24804404913

rydee on 26 Jul 2018

👍1

This should be fixed in the next release (and is fixed in the master branch)

gautamvasudevan on 31 Jul 2018

@gautamvasudevan

The following lines that probably break the CUDA functionality in nvidia-docker are still included in the next release (r1.10) : https://raw.githubusercontent.com/tensorflow/serving/r1.10/tensorflow_serving/tools/docker/Dockerfile.devel-gpu

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH}

rdwrt on 1 Aug 2018

That's intentional - they only apply to that command, and are needed so the stubs exist to build the binary.

gautamvasudevan on 1 Aug 2018

But the binary can be built succesfully without the stubs! And the stubs stop the container from using the GPU when running in nvidia-docker.

rdwrt on 1 Aug 2018

See a solution here: https://github.com/tensorflow/serving/issues/1031

rdwrt on 1 Aug 2018

run rm /usr/local/cuda/lib64/stubs/libcuda.so.1 fixed my problem

CLIsVeryOK on 13 Aug 2018

👍8 👎3 🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Encountered error while reading extension file 'protobuf.bzl': no such package '@protobuf//': Could not find handler for bind rule //external:protobuf error on ubuntu 16.04