Hello! I am using batching and am running tf-serving via the Docker container tensorflow/serving:1.13.0-gpu.
My batch.config file looks like this:
max_batch_size { value: 32 }
batch_timeout_micros { value : 0 }
num_batch_threads { value : 64 }
allowed_batch_sizes : 1
allowed_batch_sizes : 2
allowed_batch_sizes : 8
allowed_batch_sizes : 32
max_enqueued_batches { value : 100000000}
And I run everything by doing tensorflow_model_server --model_base_path=/models/object-detect --rest_api_port=8501 --port=8081 --enable_batching --batching_parameters_file=batch.config.
The GPU is a Tesla P100 and the system has 8 cores running tf 1.13.1.
When I send 1000 concurrent requests to the server, it takes around 35 seconds without batching. With batching, it takes nearly the same exact time - about 34.5 seconds.
I know that the batch.config file needs to be fine-tuned a bunch by hand, and I have messed with it a lot and tuned numbers around, but nothing seems to actually effect runtimes.
I saw some other posts mention that building tf-serving from source fixes the issue but it has not for me.
Any advice would be great!
I also met the problem and did a lot of test, and now I can get some benefits from batching, from <200 images per gpu to nearly 500 images per gpu.
I guess you may probably post 1 image per request (or maybe other kind of data, whatever ). If so, setting batch_timeout_micros to 0 means server will not wait other requests to form a batch, and it will work just the same as no batching.
You can set batch_timeout_micros to a few milliseconds, i.e. batch_timeout_micros {value : 5000} (means to wait at most 5ms to merge later requests as a batch), and then fine tune the others.
For fully use gpu devices, you can form batch at client side, which means request 16 or 32 images per request. It will much more efficient than forming batch at server side.
And here is a post relative, you may also find some help here.
For CPU, one can set _batch_timeout_micros_ to 0. Then experiment with _batch_timeout_micros_ values in the 1-10 millisecond (1000-10000 microsecond) range.
Since your scenario is with GPU, please find below approach.
1) Temporarily set batch_timeout_micros to infinity while you tune max_batch_size to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.
2) For online serving, tune batch_timeout_micros to rein in tail latency. The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch. The best value for batch_timeout_micros is typically a few milliseconds, and depends on your context and goals. Zero is a value to consider; it works well for some workloads. (For bulk processing jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.)
Closing this issue as it is in "awaiting response" for 3 days. Feel free to add your comments and we will reopen.
Hello! I am using batching and am running
tf-servingvia the Docker containertensorflow/serving:1.13.0-gpu.My
batch.configfile looks like this:max_batch_size { value: 32 } batch_timeout_micros { value : 0 } num_batch_threads { value : 64 } allowed_batch_sizes : 1 allowed_batch_sizes : 2 allowed_batch_sizes : 8 allowed_batch_sizes : 32 max_enqueued_batches { value : 100000000}And I run everything by doing
tensorflow_model_server --model_base_path=/models/object-detect --rest_api_port=8501 --port=8081 --enable_batching --batching_parameters_file=batch.config.The GPU is a Tesla P100 and the system has 8 cores running tf
1.13.1.When I send 1000 concurrent requests to the server, it takes around 35 seconds _without batching_. With batching, it takes nearly the same exact time - about 34.5 seconds.
I know that the
batch.configfile needs to be fine-tuned a bunch by hand, and I have messed with it a lot and tuned numbers around, but nothing seems to actually effect runtimes.I saw some other posts mention that building
tf-servingfrom source fixes the issue but it has not for me.Any advice would be great!
Did you find any ideas to resolve this problem?
I have the same problem.
I have tested my model to predict the embedding of image. It takes only 0.12s for 50 images with batch size 50. But when I convert the Keras model to Tensorflow SavedModel and serve by Tensorflow serving. It takes 3s to calculate embeddings.
My batch.config file
max_batch_size { value: 128 }
batch_timeout_micros { value: 3000 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 8 }
Most helpful comment
I also met the problem and did a lot of test, and now I can get some benefits from batching, from <200 images per gpu to nearly 500 images per gpu.
I guess you may probably post 1 image per request (or maybe other kind of data, whatever ). If so, setting
batch_timeout_microsto 0 means server will not wait other requests to form a batch, and it will work just the same as no batching.You can set
batch_timeout_microsto a few milliseconds, i.e.batch_timeout_micros {value : 5000}(means to wait at most 5ms to merge later requests as a batch), and then fine tune the others.For fully use gpu devices, you can form batch at client side, which means request 16 or 32 images per request. It will much more efficient than forming batch at server side.
And here is a post relative, you may also find some help here.