Serving: Latency issue while doing parallel calls to TF-serving

Created on 21 Mar 2019 · 16Comments · Source: tensorflow/serving

My TF-serving pipeline runs SIX TF-serving image models of different architecture _(like xception, densenet, nasnet, UNET etc)_ in parallel.

My time difference for 10 different images running through the pipeline in sequential order:


- Time for batch size of 1 took: 3.687 sec.
- Time for batch size of 1 took: 3.054 sec.
- Time for batch size of 1 took: 3.270 sec.
- Time for batch size of 1 took: 13.055 sec.
- Time for batch size of 1 took: 2.041 sec.
- Time for batch size of 1 took: 1.421 sec.
- Time for batch size of 1 took: 2.325 sec.
- Time for batch size of 1 took: 1.521 sec.
- Time for batch size of 1 took: 12.388 sec.
- Time for batch size of 1 took: 1.781 sec.

I don't understand why is it not following the constant graph of 1-2 sec in all the 10-requests, its returning prediction for 3-4 request very efficiently(i.e within 1-2sec).
Then suddenly for forthcoming 2-3 requests, it is peaking the latency to ~10sec(which is very bad for my application).

I am using enable_warmup_flag as to start up the TF-serving. Here is my command to start the TF-serving server:

tensorflow_model_server --port=8001 --model_config_file=/serving_models/config_files/product_type.conf --enable_batching=true --enable_model_warmup=true --per_process_gpu_memory_fraction=0.9

Relevant Information:

TensorFlow ModelServer: 1.10.0-dev
TensorFlow Library: 1.10.0

Python 3.6.6

tensorboard==1.9.0
tensorflow==1.9.0
tensorflow-gpu==1.10.1
tensorflow-serving-api==1.13.0
tensorflow-tensorboard==1.5.1
grpc-google-iam-v1==0.11.4
grpcio==1.13.0
celery==4.2.1

CUDA

libcudnn: 7.0.5
CUDA: 9.0
NCCL: 2.1.15
GCS GPUs: K80 (12GB RAM)

OS:

Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial

EXTRA:

On individual GPU of 12 GB RAM, I am loading ~25 TF-serving frozen graph model of size ~100MB.
Using CELERY with DJANGO to manage the parallelization of the calls in the pipeline.
IDEALLY, it should return prediction for all the 10 sequential requests within 10sec.

Any help would be really appreciated.

bug

Source

gr8Adakron

Most helpful comment

You can try the Tensor RT inference server from NVIDIA. It's more efficient and scalable and 3x faster than TF-Serving.

gr8Adakron on 7 Jan 2020

👍2

All 16 comments

@netfs @ewilderj @montanaflynn @lamberta @kchodorow please help!

gr8Adakron on 21 Mar 2019

Sorry,I want to ask some questions about how can I make the server to process parallel requests?Do I need run several images in the docker or just one image is ok ? I cant find anything about this in the doc of TF-sering.Any help is appreciate.Thx!

SefaZeng on 27 Mar 2019

@SefaZeng thanks for responding.

30 concurrent request (i.e 30 different images), where each request must be holding an individual image.
Something like this
github-answer2

So this is whole architecture, now the real problem is:

For image-1 it takes around 3-5secs
For image-2 it takes the same
but for image-3 and image-4 it will go around ~10 sec

Ideally, it should return me the outputs of all the concurrent 30-requests (i.e 150 requests = 30 images X 5 calls ) under ~10sec. But in my case, it goes above ~30-40 sec (at collective acknowledgment).

In short, 150 parallel calls (30 images X 5 request) to TF-serving must be treated in the same as it treats the single request, I guess.

As TF-serving has the capacity to handle the parallel requests and my warmup request flag is also set. On the contrary, I am making calls to the same sets of models, so loading and unloading of the model-graph from GPU RAM or migrating model load between CPU-GPU RAM doesn't seem to be feasible.

I am not sure what the real problem with the throttling of latency.
Any help truly appreciated. Thank you!

gr8Adakron on 27 Mar 2019

same issue, waiting for help.

songboyu on 30 Mar 2019

👍1

@netfs @ewilderj @montanaflynn @lamberta @kchodorow

songboyu on 30 Mar 2019

👍1

Are you sure you've set up warmup correctly? Please understand model warmup is not magic

it can't produce out of thin air the requests that TFS needs to use to warm up your model. You need to provide it by including the PredictionLog data you want replayed back to the server in assets.extra.

If you haven't done this, here is a guide on how to do it. If you have, please let me know so that I can take a closer look at it.

Thanks

misterpeddy on 2 Apr 2019

@unclepeddy
hey, so I have added the TFrecords in the extra.assets still not sure where to use that script?
Finished reading warmup data for model at /home/serving/pattern_resnet/serving/1/assets.extra/tf_serving_warmup_requests . Number of warmup records read: 150.
Please, just a little guidance, the documentation doesn't specify much detail.

gr8Adakron on 6 Apr 2019

Nope, as the logs indicate, you've done it correctly. The script is for producing the predictionlogs which you already seem to have...
So now you're warming up model 1 correctly. Do this for all models to ensure they're all warmed up.

misterpeddy on 11 Apr 2019

Thanks alot.

gr8Adakron on 12 Apr 2019

@ewilderj @kchodorow @lamberta @jart @unclepeddy
Still, the output is not consistent, if I am calling(inferencing with TF-serving) 8 models parallelly for one single image then this is the 5-iteration time sequence:

Iteration-1 for image_1.jpg
```
- xception ............ [0.063sec]
- densenet ............ [0.095sec]
- mobilenet ............ [0.045sec]
- mxnet152 ............ [0.090sec]
- mxnet152places ............ [8.918sec]
- segmentation_1........... [0.146sec]
- segmentation_2............ [0.147sec]
- human_classification ............ [8.826sec]

Iteration-2 for image_2.jpg
 ```
        - xception ............ [0.080sec]
         - densenet ............ [0.102sec]
         - mobilenet ............ [0.026sec]
         - mxnet152 ............ [0.150sec]
         - mxnet152places ............ [0.174sec]
         - segmentation_1............ [0.055sec]
         - segmentation_2............ [0.172sec]
         - human_classification ............ [0.056sec]

Iteration-3 for image_3.jpg
```
- xception ............ [8.933sec]
- densenet ............ [0.058sec]
- mobilenet ............ [8.866sec]
- mxnet152 ............ [0.146sec]
- mxnet152places ............ [0.169sec]
- segmentation_1............ [0.062sec]
- segmentation_2............ [0.172sec]
- human_classification ............ [0.056sec]

Iteration-4 for image_4.jpg
 ```
       - xception ............ [0.057sec]
         - densenet ............ [0.116sec]
         - mobilenet ............ [0.097sec]
         - mxnet152 ............ [0.161sec]
         - mxnet152places ............ [0.089sec]
         - segmentation_1............ [0.050sec]
         - segmentation_2............ [0.164sec]
         - human_classification ............ [8.632sec]

Iteration-5 for image_5.jpg
```
- xception ............ [8.789sec]
- densenet ............ [0.112sec]
- mobilenet ............ [0.108sec]
- mxnet152 ............ [8.826sec]
- mxnet152places ............ [0.095sec]
- segmentation_1............ [0.059sec]
- segmentation_2............ [8.883sec]
- human_classification ............ [0.056sec]

```

The best iteration was the iteration-2, which is the ideal case, but its so rare, somehow in all the iteration one or other model goes beyond 8sec which defeats the whole purpose of utilizing TF-serving in deployment.

I have 4 K80-GPUs and all these 8-models have been distributed across all the 4-GPUs, I mean:

xception, densenet and mobilenet are been hosted on GPU number 1 using TF-serving.
mxnet152, mxnet152places are been hosted on GPU number 2 using TF-serving.
human_classification is been hosted on GPU number 3 using TF-serving.
segmentation_1 and segmentation_2 are been hosted on GPU number 4 using TF-serving.

These models aren't even huge, all of them are around the size of ~100-150 MB.
My system configuration and process I have already mentioned above comments.

Trying to figure the optimization using TF-serving for months, unsuccessful.
Any help, really appreciated.

gr8Adakron on 4 May 2019

@gr8Adakron Did you find any solutions to this issue? I'm having similar issue with using multiple models in serving and dealing with high latency issue with parallel calls. I don't understand the concept of --per_process_gpu_memory_fraction; I mean as per the description it says; If 0.0, Tensorflow will automatically select a value. So, I don't get what exact value I've to use here.

spate141 on 4 May 2019

Unfortunately no one has any intuition as to why this is happening. We're trying to replicate this internally but haven't been successful in doing so.

So please either provide traces of the server, provide the models+requests and let us know how you're 1) pegging models to GPU cards and 2) coordinating the requests through the pipeline so we can try to reproduce and help.

misterpeddy on 14 Jun 2019

@spate141 We haven't heard from you for more than 7 days. Without the information asked by @unclepeddy, we can't proceed. We will close this issue for now. If you still need help, please return to this case and reopen it with the information to investigate further.

chanshah on 21 Jun 2019

are there any tf serving optimizations you can do to reach high speed within 1-30ms?

geraldstanje on 20 Dec 2019

You can try the Tensor RT inference server from NVIDIA. It's more efficient and scalable and 3x faster than TF-Serving.

gr8Adakron on 7 Jan 2020

👍2

@gr8Adakron
I have a grpc client with 100 connections and 2 threads per connection to Tensorflow serving.
I currently run TensorFlow Serving with cpu only (with 6 cores), the p99 latency is around 20ms (no batching).
lets say I switch from cpu to cpu, will it help to increase the throughput and still keep p99 latency at 20ms?