Serving: Build tensorflow-model-server for gpu - Cannot find cuda library libcudnn.so.6

Created on 21 Jul 2018 · 14Comments · Source: tensorflow/serving

System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux ip-172-30-1-83 4.4.0-1062-aws #71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

TensorFlow installed from (source or binary): binary

TensorFlow version (use command below): 1.5.0

Python version: Python 2.7.12

Bazel version (if compiling from source):

GCC/Compiler version (if compiling from source): c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

CUDA/cuDNN version:

== cuda libs ===================================================
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0.176
/usr/local/cuda-9.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.0/doc/man/man7/libcudart.7

But I also have these files:
/usr/lib/x86_64-linux-gnu/libcudnn.so -> /etc/alternatives/libcudnn_so
/usr/lib/x86_64-linux-gnu/libcudnn.so.6 -> libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7 -> libcudnn.so.7.0.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

GPU model and memory: Tesla K80
Describe the problem
I try to build tensorflow-model-server with gpu support, as I saw that the apt-get version is only for CPU.

I did:

git clone --recurse-submodules https://github.com/tensorflow/serving
bazel clean --expunge && export TF_NEED_CUDA=1
bazel query 'kind(rule, @local_config_cuda//...)'
And got:

Cuda Configuration Error: Cannot find cuda library libcudnn.so.6

When I do: bazel build -c opt --config=cuda tensorflow_serving/model_servers:tensorflow_model_server
I get the same error.

builinstall performance

Source

ndvbd

Most helpful comment

Try symlinks like

ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.6 /usr/local/cuda/lib64/libcudnn.so.6
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so

The process you are attempting is extremely messy and no one is intent on cleaning it up. You may be able to extract bits and pieces of useful information from the x86 docker install

dsmiller on 24 Jul 2018

👍2

All 14 comments

Try symlinks like

ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.6 /usr/local/cuda/lib64/libcudnn.so.6
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so

The process you are attempting is extremely messy and no one is intent on cleaning it up. You may be able to extract bits and pieces of useful information from the x86 docker install

dsmiller on 24 Jul 2018

👍2

Thanks. This first problem was fixed. After that, I started the build by:

bazel build -c opt --config=cuda tensorflow_serving/model_servers:tensorflow_model_server

And the build failed. The tail of the log is this:

WARNING: /usr/deeplearning/tensorflow_serving/serving/tensorflow_serving/servables/tensorflow/BUILD:488:1: in cc_library rule //tensorflow_serving/servables/tensorflow:get_model_metadata_impl: target '//tensorflow_serving/servables/tensorflow:get_model_metadata_impl' depends on deprecated target '@org_tensorflow//tensorflow/contrib/session_bundle:session_bundle': No longer supported. Switch to SavedModel immediately. INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (131 packages loaded). INFO: Found 1 target... ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/03fd35feca1f46033f1118e2b4733251/external/com_github_libevent_libevent/BUILD.bazel:52:1: Executing genrule @com_github_libevent_libevent//:libevent-srcs failed (Exit 127) ./autogen.sh: 18: ./autogen.sh: aclocal: not found Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 47.544s, Critical Path: 6.17s
INFO: 9 processes: 9 local.
FAILED: Build did NOT complete successfully

ndvbd on 25 Jul 2018

See the dependencies starting on line 27 of the docker example. It looks like you don't have automake.

dsmiller on 25 Jul 2018

👍1

Following the instructions in the docker file is your best bet. Using the docker instructions to setup a build environment is your easiest bet.

gautamvasudevan on 25 Jul 2018

I gave the docker way a try.
I did:

sudo docker pull tensorflow/serving:latest-devel-gpu
sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu

And then inside I ran the tensorflow_model_server.
However, it writes:

2018-07-25 21:49:16.919411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(-1)
2018-07-25 21:49:16.919458: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2018-07-25 21:49:17.102061: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:161] Restoring SavedModel bundle

Besides that, when I just get into the docker by: sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu, and type nvidia-smi, I get command not found.

ndvbd on 26 Jul 2018

docker pull tensorflow/serving:latest-devel-gpu
docker run -it -p 8500:8500 tensorflow/serving:latest-devel

You pulled the latest devel GPU build, but then ran the latest-devel build without GPU support.

Try:
nvidia-docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu

gautamvasudevan on 26 Jul 2018

@gautamvasudevan,

It was a typo in the github comment I wrote (I fixed it) I did run the
sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu
And still no nvidia-smi available there.
Just to make it clear, I tried running the tensorflow_model_server directly, without any bazel-bin - is this okay, or must we run it with bazel-bin?
I didn't know there is such thing as nvidia-docker, and it's not mentioned here, but indeed if I run the container with nvidia-docker, I have the nvidia-smi command available, but when I run the tensoflow_model_server I still get:

2018-07-27 08:01:04.519852: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(-1)
2018-07-27 08:01:04.519886: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 03866238e29e
2018-07-27 08:01:04.519898: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 03866238e29e
2018-07-27 08:01:04.520001: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program

However, I did a performance test and it indeed runs on a GPU! So simply running
sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu
and then then
tensorflow_model_server
makes it run on a GPU (and I don't actually need the nvidia-docker), in spite all the prints that it says about
failed call to cuInit

How can it be?
It would be nice if the tensorflow_model_server simply print if the model is serving on a GPU or CPU...

ndvbd on 27 Jul 2018

The development environment installs a binary to /usr/local/bin so you don't need to use bazel to run the binary.

TF Serving does let you know if it's using the GPU. You should see log output like the following:

2018-07-27 00:05:09.541989: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: Quadro K1200 major: 5 minor: 0 memoryClockRate(GHz): 1.0325
pciBusID: 0000:02:00.0
totalMemory: 3.92GiB freeMemory: 3.26GiB
2018-07-27 00:05:09.542021: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-27 00:07:20.727204: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-27 00:07:20.727231: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-07-27 00:07:20.727239: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-07-27 00:07:20.727378: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2983 MB memory) -> physical GPU (device: 0, name: Quadro K1200, pci bus id: 0000:02:00.0, compute capability: 5.0)

It's difficult to debug your case without being able to easily follow the steps to reproduce it. You absolutely do need nvidia-docker to use the with the container, as docker containers are hardware agnostic (they don't load the necessary kernel modules or user-level libraries) and cannot use your GPU. That's why you're seeing those errors.

You can learn more about it here:

https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/

gautamvasudevan on 27 Jul 2018

As I wrote,

Even if I use nvidia-docker, I get the same messages.
Effectively, when I use docker or nvidia-docker the model runs in a GPU speed. Something weird is happening there.

ndvbd on 31 Jul 2018

Well - it finally worked!

Taking an already-built docker using:
docker pull tensorflow/serving:latest-devel-gpu
didn't work for me,
But when I built it, using a Docker file (the build took 3 hours), it finally worked, giving a positive confirmation that a GPU is being used (and was really faster):

sudo nvidia-docker build --pull -t $USER/tensorflow-serving-devel-gpu -f Dockerfile.devel-gpu .
The build process took a lot of disk space, so I actually had to increase the volume size, and re-run it, and then it worked.

Positive GPU confirmation:

2018-08-02 21:12:34.473101: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0
2018-08-02 21:12:34.851668: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-02 21:12:34.851731: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:972]      0 
2018-08-02 21:12:34.851748: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0:   N 
2018-08-02 21:12:34.852075: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10756 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

Thanks @dsmiller, @gautamvasudevan for the assistance!

ndvbd on 2 Aug 2018

Awesome! Glad to hear. Docker images should be fixed with the next release.

gautamvasudevan on 3 Aug 2018

FYI you can use docker pull tensorflow/serving:latest-gpu now, which is a serving image that doesn't require any building. See the updated Docker instructions for how to serve with a GPU more easily.

gautamvasudevan on 11 Aug 2018

That's great news

ndvbd on 11 Aug 2018

Hi guys, I added some custom ops into the tensorflow and to deploy the models we need to build the custom tensorflow_model_server, I forgot how to build this kind of tensorflow_model_server. Could anybody give some advice please? Thanks.