I successfully installed tf-serving with
bazel build -c opt --config=cuda tensorflow_serving/...
But when I launch a tf-serving model, All the 4 GPUs allocate all available GPU memory.
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception_gpu --model_base_path=./tf_servables/inception/inception_gpu
Log
2018-02-23 16:58:28.822617: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1208] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:83:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-23 16:58:28.825720: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1223] Device peer to peer matrix
2018-02-23 16:58:28.825818: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1229] DMA: 0 1 2 3
2018-02-23 16:58:28.825830: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 0: Y Y N N
2018-02-23 16:58:28.825838: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 1: Y Y N N
2018-02-23 16:58:28.825846: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 2: N N Y Y
2018-02-23 16:58:28.825853: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1239] 3: N N Y Y
2018-02-23 16:58:28.825866: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1308] Adding visible gpu devices: 0, 1, 2, 3
2018-02-23 16:58:31.573712: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15130 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0, compute capability: 6.0)
2018-02-23 16:58:31.881263: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8561 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:03:00.0, compute capability: 6.0)
2018-02-23 16:58:32.083473: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 15130 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
2018-02-23 16:58:32.374781: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 15130 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0, compute capability: 6.0)
2018-02-23 16:58:32.763353: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:159] Restoring SavedModel bundle.
2018-02-23 16:58:33.022217: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:194] Running LegacyInitOp on SavedModel bundle.
2018-02-23 16:58:33.105502: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:289] SavedModel load for tags { serve }; Status: success. Took 9150846 microseconds.
2018-02-23 16:58:33.105750: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: inception_gpu version: 1}
2018-02-23 16:58:33.122060: I tensorflow_serving/model_servers/main.cc:280] Running ModelServer at 0.0.0.0:9000 ...
And I check it with nvidia-smi
Fri Feb 23 17:04:47 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 30C P0 30W / 250W | 15481MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:03:00.0 Off | 0 |
| N/A 31C P0 31W / 250W | 15826MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 30C P0 31W / 250W | 15481MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:83:00.0 Off | 0 |
| N/A 30C P0 32W / 250W | 15481MiB / 16276MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18242 C ...g/model_servers/tensorflow_model_server 15471MiB |
| 1 15809 C python 6915MiB |
| 1 18242 C ...g/model_servers/tensorflow_model_server 8901MiB |
| 2 18242 C ...g/model_servers/tensorflow_model_server 15471MiB |
| 3 18242 C ...g/model_servers/tensorflow_model_server 15471MiB |
+-----------------------------------------------------------------------------+
(PID 15809 should be ignored as it is another process)
I export the tf-serving model following the inception instructions and try to only specify the 4-th gpu (for example) with gpu-option
# Restore variables from training checkpoint.
variable_averages = tf.train.ExponentialMovingAverage(
inception_model.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
config = tf.ConfigProto(
device_count = {
'GPU':1
},
gpu_options = {
'allow_growth':1,
# 'per_process_gpu_memory_fraction':0.01
'visible_device_list':"3"
},
allow_soft_placement=True,
log_device_placement=False
)
with tf.Session(config=config) as sess:
# Restore variables from training checkpoints.
ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
Log
2018-02-23 17:31:29.760189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:83:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-02-23 17:31:29.760217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0, compute capability: 6.0)
Successfully loaded model from /home/tonychou.zyb/Tensorflow/ModelZoo/tf_checkpoints/inception/20160301/model.ckpt-157585 at step=157585.
I also tried
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"]="2"
Neither works.
Could anyone help me figure out? I should be able to specify both the GPU and GPU-memory.
Thanks in advance!
The only way at the moment is to launch multiple dockers, one for each GPU and limiting scope such that each docker only sees one GPU. Here is example when machine has 4 GPUs:
NV_GPU=0 nvidia-docker run -d -v /tmp/data:/data -p 9100:9100 --name tensorflow_serving_gpu0 -it c0a8204d1ead
NV_GPU=1 nvidia-docker run -d -v /tmp/data:/data -p 9101:9100 --name tensorflow_serving_gpu1 -it c0a8204d1ead
NV_GPU=2 nvidia-docker run -d -v /tmp/data:/data -p 9102:9100 --name tensorflow_serving_gpu2 -it c0a8204d1ead
NV_GPU=3 nvidia-docker run -d -v /tmp/data:/data -p 9103:9100 --name tensorflow_serving_gpu3 -it c0a8204d1ead
@vitalyli I think it is not very flexible. Thanks anyway.
@TonyChouZJU When you want to restrict the use of GPU you should set flags and writing configurations at tensorflow_model_server. I recommend you see #836 and that may answer your question. By the way if you want to specify which GPU to serve you could just add CUDA_VISIBLE_DEVICES=0 at the beginning of your command. Things like per_process_gpu_memory_fraction ,allow_growth and allow_soft_placement should be written in platform_config_file. You just use these commands during export model so they just work during that period but not serving period. The final command to run server should be like:
CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception_gpu --model_base_path=./tf_servables/inception/inception_gpu --platform_config_file=platform_session.conf.
@Li-Shu14 , hi, I would like to ask do you have any experience on running a separate tensorflow serving server for each GPU.
I have a machine with two 1080 Tis. My TF-Serving is able to correctly identify both of the GPUs when the CUDA_VISIBLE_DEVICES flag is not set. And I was trying to run one TF-Serving server per GPU, so I opened two terminals to run the following two commands:
CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception --model_base_path=/path/to/inception_model
CUDA_VISIBLE_DEVICES=1 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9001 --model_name=mnist --model_base_path=/path/to/mnist_model
The first command was running fine, but the second one will reach error says
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
[1] 4021 abort (core dumped) CUDA_VISIBLE_DEVICES=1 --port=9001 --model_name=mnist
And idea or suggestions? Thanks in advance!
@sugartom Running separate server for each GPU is feasible from my experience, but I have never encountered the error you described. I wonder whether this error will occur when these two commands are running in the other order, and if so, which means you can run only one server each time, then it's likely to be the problem of hardware. Maybe "Resource temporarily unavailable" has something to do with the number of CPU-cores (or some settings which limits the use of them)? Sorry I cannot think out more ideas on your problem.
@Li-Shu14 , hi, thanks for your reply!
I did try running those two commands in different order. Namely, start one server on gpu:1 first, and then another server on gpu:0. Again, the first server on gpu:1 will run normally, but the second server on gpu:0 will fail. So I think the order is not the reason for the failure. I will check whether number of CPU-cores is the reason. And thanks again for your suggestions! :-)
@TonyChouZJU - Hi, is this still an issue ? If not please feel free to close this. Thanks !
I have tried both ways according to this document to allow only one GPU in docker but somehow all 6 of my GPUs are being used:
docker run --runtime=nvidia -p 8500:8500 \
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 \
--mount type=bind,source=/home/eee/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
--mount type=bind,source=/home/eee/T2T_Model/batching.conf,target=/models/batching.conf \
-e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
NV_GPU=0 docker run --runtime=nvidia -p 8500:8500 \
--mount type=bind,source=/home/eee/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
--mount type=bind,source=/home/eee/T2T_Model/batching.conf,target=/models/batching.conf \
-e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
Am I missing something?
SOLVED
This command solved my problem:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 \
--mount type=bind,source=/home/eee/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
--mount type=bind,source=/home/eee/T2T_Model/batching.conf,target=/models/batching.conf \
-e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
The only way at the moment is to launch multiple dockers, one for each GPU and limiting scope such that each docker only sees one GPU. Here is example when machine has 4 GPUs:
NV_GPU=0 nvidia-docker run -d -v /tmp/data:/data -p 9100:9100 --name tensorflow_serving_gpu0 -it c0a8204d1ead
NV_GPU=1 nvidia-docker run -d -v /tmp/data:/data -p 9101:9100 --name tensorflow_serving_gpu1 -it c0a8204d1ead
NV_GPU=2 nvidia-docker run -d -v /tmp/data:/data -p 9102:9100 --name tensorflow_serving_gpu2 -it c0a8204d1ead
NV_GPU=3 nvidia-docker run -d -v /tmp/data:/data -p 9103:9100 --name tensorflow_serving_gpu3 -it c0a8204d1ead
@vitalyli I'm trying to do the same thing as you by launching multiple docker containers of the same image. However, I am only able to get one docker container to do inference, the rest of my clients give me the error:
error: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "OS Error"
debug_error_string = "{"created":"@1547479281.497664676","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1036,"grpc_message":"OS Error","grpc_status":14}"
This is the command I use to launch docker containers:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8500:8500 \
--mount type=bind,source=/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
--mount type=bind,source=/T2T_Model/batching.conf,target=/models/batching.conf \
-e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 -p 8510:8510 \
--mount type=bind,source=/T2T_Model/t2t_train/translate_enzh_wmt32k/transformer-transformer_base/export/,target=/models/my_model \
--mount type=bind,source=/T2T_Model/batching.conf,target=/models/batching.conf \
-e MODEL_NAME=my_model -t tensorflow/serving:latest-gpu --batching_parameters_file=/models/batching.conf --enable_batching
This is the command I use to launch my tf client for inference:
t2t-query-server \
--server=0.0.0.0:8500 \
--servable_name=my_model \
--problem=translate_enzh_wmt32k \
--data_dir=/T2T_Model/t2t_data/1 \
--timeout_secs=30 \
t2t-query-server \
--server=0.0.0.0:8510 \
--servable_name=my_model \
--problem=translate_enzh_wmt32k \
--data_dir=/T2T_Model/t2t_data/1 \
--timeout_secs=30 \
What could I possibly be doing wrong?
UPDATE:
The problem seems to be that gRPC is on 0.0.0.0:8500 for all my docker containers despite running my docker with different ports docker run -p 8510:8510:
tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ...
SOLVED:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 8510:8500
Closing this at it is in "awaiting response" status for more than a week. Feel free to add comments(if any), we will reopen the issue. Thanks !
Most helpful comment
@TonyChouZJU When you want to restrict the use of GPU you should set flags and writing configurations at tensorflow_model_server. I recommend you see #836 and that may answer your question. By the way if you want to specify which GPU to serve you could just add
CUDA_VISIBLE_DEVICES=0at the beginning of your command. Things likeper_process_gpu_memory_fraction,allow_growthandallow_soft_placementshould be written inplatform_config_file. You just use these commands during export model so they just work during that period but not serving period. The final command to run server should be like:CUDA_VISIBLE_DEVICES=0 bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception_gpu --model_base_path=./tf_servables/inception/inception_gpu --platform_config_file=platform_session.conf.