System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux ip-172-30-1-83 4.4.0-1062-aws #71-Ubuntu SMP Fri Jun 15 10:07:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.5.0
Python version: Python 2.7.12
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source): c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CUDA/cuDNN version:
== cuda libs ===================================================
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0.176
/usr/local/cuda-9.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.0/doc/man/man7/libcudart.7
But I also have these files:
/usr/lib/x86_64-linux-gnu/libcudnn.so -> /etc/alternatives/libcudnn_so
/usr/lib/x86_64-linux-gnu/libcudnn.so.6 -> libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7 -> libcudnn.so.7.0.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5
GPU model and memory: Tesla K80
Describe the problem
I try to build tensorflow-model-server with gpu support, as I saw that the apt-get version is only for CPU.
I did:
git clone --recurse-submodules https://github.com/tensorflow/serving
bazel clean --expunge && export TF_NEED_CUDA=1
bazel query 'kind(rule, @local_config_cuda//...)'
And got:
Cuda Configuration Error: Cannot find cuda library libcudnn.so.6
When I do: bazel build -c opt --config=cuda tensorflow_serving/model_servers:tensorflow_model_server
I get the same error.
Try symlinks like
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.6 /usr/local/cuda/lib64/libcudnn.so.6
ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so
The process you are attempting is extremely messy and no one is intent on cleaning it up. You may be able to extract bits and pieces of useful information from the x86 docker install
Thanks. This first problem was fixed. After that, I started the build by:
bazel build -c opt --config=cuda tensorflow_serving/model_servers:tensorflow_model_server
And the build failed. The tail of the log is this:
WARNING: /usr/deeplearning/tensorflow_serving/serving/tensorflow_serving/servables/tensorflow/BUILD:488:1: in cc_library rule //tensorflow_serving/servables/tensorflow:get_model_metadata_impl: target '//tensorflow_serving/servables/tensorflow:get_model_metadata_impl' depends on deprecated target '@org_tensorflow//tensorflow/contrib/session_bundle:session_bundle': No longer supported. Switch to SavedModel immediately.
INFO: Analysed target //tensorflow_serving/model_servers:tensorflow_model_server (131 packages loaded).
INFO: Found 1 target...
ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/03fd35feca1f46033f1118e2b4733251/external/com_github_libevent_libevent/BUILD.bazel:52:1: Executing genrule @com_github_libevent_libevent//:libevent-srcs failed (Exit 127)
./autogen.sh: 18: ./autogen.sh: aclocal: not found
Target //tensorflow_serving/model_servers:tensorflow_model_server failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 47.544s, Cri
tical Path: 6.17s
INFO: 9 processes: 9 local.
FAILED: Build did NOT complete successfully
See the dependencies starting on line 27 of the docker example. It looks like you don't have automake.
Following the instructions in the docker file is your best bet. Using the docker instructions to setup a build environment is your easiest bet.
I gave the docker way a try.
I did:
sudo docker pull tensorflow/serving:latest-devel-gpu
sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu
And then inside I ran the tensorflow_model_server.
However, it writes:
2018-07-25 21:49:16.919411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(-1)
2018-07-25 21:49:16.919458: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:152] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2018-07-25 21:49:17.102061: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:161] Restoring SavedModel bundle
Besides that, when I just get into the docker by: sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu, and type nvidia-smi, I get command not found.
docker pull tensorflow/serving:latest-devel-gpu
docker run -it -p 8500:8500 tensorflow/serving:latest-devel
You pulled the latest devel GPU build, but then ran the latest-devel build without GPU support.
Try:
nvidia-docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu
@gautamvasudevan,
sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu2018-07-27 08:01:04.519852: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(-1)
2018-07-27 08:01:04.519886: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 03866238e29e
2018-07-27 08:01:04.519898: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 03866238e29e
2018-07-27 08:01:04.520001: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
However, I did a performance test and it indeed runs on a GPU! So simply running
sudo docker run -it -p 8500:8500 tensorflow/serving:latest-devel-gpu
and then then
tensorflow_model_server
makes it run on a GPU (and I don't actually need the nvidia-docker), in spite all the prints that it says about
failed call to cuInit
How can it be?
It would be nice if the tensorflow_model_server simply print if the model is serving on a GPU or CPU...
The development environment installs a binary to /usr/local/bin so you don't need to use bazel to run the binary.
TF Serving does let you know if it's using the GPU. You should see log output like the following:
2018-07-27 00:05:09.541989: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: Quadro K1200 major: 5 minor: 0 memoryClockRate(GHz): 1.0325
pciBusID: 0000:02:00.0
totalMemory: 3.92GiB freeMemory: 3.26GiB
2018-07-27 00:05:09.542021: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-27 00:07:20.727204: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-27 00:07:20.727231: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2018-07-27 00:07:20.727239: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2018-07-27 00:07:20.727378: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2983 MB memory) -> physical GPU (device: 0, name: Quadro K1200, pci bus id: 0000:02:00.0, compute capability: 5.0)
It's difficult to debug your case without being able to easily follow the steps to reproduce it. You absolutely do need nvidia-docker to use the with the container, as docker containers are hardware agnostic (they don't load the necessary kernel modules or user-level libraries) and cannot use your GPU. That's why you're seeing those errors.
You can learn more about it here:
https://devblogs.nvidia.com/nvidia-docker-gpu-server-application-deployment-made-easy/
As I wrote,
Well - it finally worked!
Taking an already-built docker using:
docker pull tensorflow/serving:latest-devel-gpu
didn't work for me,
But when I built it, using a Docker file (the build took 3 hours), it finally worked, giving a positive confirmation that a GPU is being used (and was really faster):
sudo nvidia-docker build --pull -t $USER/tensorflow-serving-devel-gpu -f Dockerfile.devel-gpu .
The build process took a lot of disk space, so I actually had to increase the volume size, and re-run it, and then it worked.
Positive GPU confirmation:
2018-08-02 21:12:34.473101: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1485] Adding visible gpu devices: 0
2018-08-02 21:12:34.851668: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:966] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-02 21:12:34.851731: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:972] 0
2018-08-02 21:12:34.851748: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:985] 0: N
2018-08-02 21:12:34.852075: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10756 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Thanks @dsmiller, @gautamvasudevan for the assistance!
Awesome! Glad to hear. Docker images should be fixed with the next release.
FYI you can use docker pull tensorflow/serving:latest-gpu now, which is a serving image that doesn't require any building. See the updated Docker instructions for how to serve with a GPU more easily.
That's great news
Hi guys, I added some custom ops into the tensorflow and to deploy the models we need to build the custom tensorflow_model_server, I forgot how to build this kind of tensorflow_model_server. Could anybody give some advice please? Thanks.
Most helpful comment
Try symlinks like
The process you are attempting is extremely messy and no one is intent on cleaning it up. You may be able to extract bits and pieces of useful information from the x86 docker install