Serving: Latency issue while doing parallel calls to TF-serving

Created on 21 Mar 2019  路  16Comments  路  Source: tensorflow/serving

My TF-serving pipeline runs SIX TF-serving image models of different architecture _(like xception, densenet, nasnet, UNET etc)_ in parallel.

My time difference for 10 different images running through the pipeline in sequential order:


- Time for batch size of 1 took: 3.687 sec.
- Time for batch size of 1 took: 3.054 sec.
- Time for batch size of 1 took: 3.270 sec.
- Time for batch size of 1 took: 13.055 sec.
- Time for batch size of 1 took: 2.041 sec.
- Time for batch size of 1 took: 1.421 sec.
- Time for batch size of 1 took: 2.325 sec.
- Time for batch size of 1 took: 1.521 sec.
- Time for batch size of 1 took: 12.388 sec.
- Time for batch size of 1 took: 1.781 sec.

I don't understand why is it not following the constant graph of 1-2 sec in all the 10-requests, its returning prediction for 3-4 request very efficiently(i.e within 1-2sec).
Then suddenly for forthcoming 2-3 requests, it is peaking the latency to ~10sec(which is very bad for my application).

I am using enable_warmup_flag as to start up the TF-serving. Here is my command to start the TF-serving server:

tensorflow_model_server --port=8001 --model_config_file=/serving_models/config_files/product_type.conf --enable_batching=true --enable_model_warmup=true --per_process_gpu_memory_fraction=0.9

Relevant Information:

  • TensorFlow ModelServer: 1.10.0-dev
  • TensorFlow Library: 1.10.0

Python 3.6.6

  • tensorboard==1.9.0
  • tensorflow==1.9.0
  • tensorflow-gpu==1.10.1
  • tensorflow-serving-api==1.13.0
  • tensorflow-tensorboard==1.5.1
  • grpc-google-iam-v1==0.11.4
  • grpcio==1.13.0
  • celery==4.2.1

CUDA

  • libcudnn: 7.0.5
  • CUDA: 9.0
  • NCCL: 2.1.15
  • GCS GPUs: K80 (12GB RAM)

OS:

  • Distributor ID: Ubuntu
  • Description: Ubuntu 16.04.6 LTS
  • Release: 16.04
  • Codename: xenial

EXTRA:

  • On individual GPU of 12 GB RAM, I am loading ~25 TF-serving frozen graph model of size ~100MB.
  • Using CELERY with DJANGO to manage the parallelization of the calls in the pipeline.
  • IDEALLY, it should return prediction for all the 10 sequential requests within 10sec.

Any help would be really appreciated.

bug

Most helpful comment

You can try the Tensor RT inference server from NVIDIA. It's more efficient and scalable and 3x faster than TF-Serving.

All 16 comments

@netfs @ewilderj @montanaflynn @lamberta @kchodorow please help!

Sorry,I want to ask some questions about how can I make the server to process parallel requests?Do I need run several images in the docker or just one image is ok ? I cant find anything about this in the doc of TF-sering.Any help is appreciate.Thx!

@SefaZeng thanks for responding.

30 concurrent request (i.e 30 different images), where each request must be holding an individual image.
Something like this
github-answer2

So this is whole architecture, now the real problem is:

  • For image-1 it takes around 3-5secs
  • For image-2 it takes the same
  • but for image-3 and image-4 it will go around ~10 sec

Ideally, it should return me the outputs of all the concurrent 30-requests (i.e 150 requests = 30 images X 5 calls ) under ~10sec. But in my case, it goes above ~30-40 sec (at collective acknowledgment).

In short, 150 parallel calls (30 images X 5 request) to TF-serving must be treated in the same as it treats the single request, I guess.

As TF-serving has the capacity to handle the parallel requests and my warmup request flag is also set. On the contrary, I am making calls to the same sets of models, so loading and unloading of the model-graph from GPU RAM or migrating model load between CPU-GPU RAM doesn't seem to be feasible.

I am not sure what the real problem with the throttling of latency.
Any help truly appreciated. Thank you!

same issue, waiting for help.

@netfs @ewilderj @montanaflynn @lamberta @kchodorow

Are you sure you've set up warmup correctly? Please understand model warmup is not magic

  • it can't produce out of thin air the requests that TFS needs to use to warm up your model. You need to provide it by including the PredictionLog data you want replayed back to the server in assets.extra.

If you haven't done this, here is a guide on how to do it. If you have, please let me know so that I can take a closer look at it.

Thanks

@unclepeddy
hey, so I have added the TFrecords in the extra.assets still not sure where to use that script?
Finished reading warmup data for model at /home/serving/pattern_resnet/serving/1/assets.extra/tf_serving_warmup_requests . Number of warmup records read: 150.
Please, just a little guidance, the documentation doesn't specify much detail.

Nope, as the logs indicate, you've done it correctly. The script is for producing the predictionlogs which you already seem to have...
So now you're warming up model 1 correctly. Do this for all models to ensure they're all warmed up.

Thanks alot.

@ewilderj @kchodorow @lamberta @jart @unclepeddy
Still, the output is not consistent, if I am calling(inferencing with TF-serving) 8 models parallelly for one single image then this is the 5-iteration time sequence:

Iteration-1 for image_1.jpg
```
- xception ............ [0.063sec]
- densenet ............ [0.095sec]
- mobilenet ............ [0.045sec]
- mxnet152 ............ [0.090sec]
- mxnet152places ............ [8.918sec]
- segmentation_1........... [0.146sec]
- segmentation_2............ [0.147sec]
- human_classification ............ [8.826sec]

Iteration-2 for image_2.jpg
 ```
        - xception ............ [0.080sec]
         - densenet ............ [0.102sec]
         - mobilenet ............ [0.026sec]
         - mxnet152 ............ [0.150sec]
         - mxnet152places ............ [0.174sec]
         - segmentation_1............ [0.055sec]
         - segmentation_2............ [0.172sec]
         - human_classification ............ [0.056sec]

Iteration-3 for image_3.jpg
```
- xception ............ [8.933sec]
- densenet ............ [0.058sec]
- mobilenet ............ [8.866sec]
- mxnet152 ............ [0.146sec]
- mxnet152places ............ [0.169sec]
- segmentation_1............ [0.062sec]
- segmentation_2............ [0.172sec]
- human_classification ............ [0.056sec]

Iteration-4 for image_4.jpg
 ```
       - xception ............ [0.057sec]
         - densenet ............ [0.116sec]
         - mobilenet ............ [0.097sec]
         - mxnet152 ............ [0.161sec]
         - mxnet152places ............ [0.089sec]
         - segmentation_1............ [0.050sec]
         - segmentation_2............ [0.164sec]
         - human_classification ............ [8.632sec]

Iteration-5 for image_5.jpg
```
- xception ............ [8.789sec]
- densenet ............ [0.112sec]
- mobilenet ............ [0.108sec]
- mxnet152 ............ [8.826sec]
- mxnet152places ............ [0.095sec]
- segmentation_1............ [0.059sec]
- segmentation_2............ [8.883sec]
- human_classification ............ [0.056sec]

```

The best iteration was the iteration-2, which is the ideal case, but its so rare, somehow in all the iteration one or other model goes beyond 8sec which defeats the whole purpose of utilizing TF-serving in deployment.

I have 4 K80-GPUs and all these 8-models have been distributed across all the 4-GPUs, I mean:

  • xception, densenet and mobilenet are been hosted on GPU number 1 using TF-serving.
  • mxnet152, mxnet152places are been hosted on GPU number 2 using TF-serving.
  • human_classification is been hosted on GPU number 3 using TF-serving.
  • segmentation_1 and segmentation_2 are been hosted on GPU number 4 using TF-serving.

These models aren't even huge, all of them are around the size of ~100-150 MB.
My system configuration and process I have already mentioned above comments.

Trying to figure the optimization using TF-serving for months, unsuccessful.
Any help, really appreciated.

@gr8Adakron Did you find any solutions to this issue? I'm having similar issue with using multiple models in serving and dealing with high latency issue with parallel calls. I don't understand the concept of --per_process_gpu_memory_fraction; I mean as per the description it says; If 0.0, Tensorflow will automatically select a value. So, I don't get what exact value I've to use here.

Unfortunately no one has any intuition as to why this is happening. We're trying to replicate this internally but haven't been successful in doing so.

So please either provide traces of the server, provide the models+requests and let us know how you're 1) pegging models to GPU cards and 2) coordinating the requests through the pipeline so we can try to reproduce and help.

@spate141 We haven't heard from you for more than 7 days. Without the information asked by @unclepeddy, we can't proceed. We will close this issue for now. If you still need help, please return to this case and reopen it with the information to investigate further.

are there any tf serving optimizations you can do to reach high speed within 1-30ms?

You can try the Tensor RT inference server from NVIDIA. It's more efficient and scalable and 3x faster than TF-Serving.

@gr8Adakron
I have a grpc client with 100 connections and 2 threads per connection to Tensorflow serving.
I currently run TensorFlow Serving with cpu only (with 6 cores), the p99 latency is around 20ms (no batching).
lets say I switch from cpu to cpu, will it help to increase the throughput and still keep p99 latency at 20ms?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cchung100m picture cchung100m  路  4Comments

OmriShiv picture OmriShiv  路  3Comments

dylanrandle picture dylanrandle  路  3Comments

rsk-07 picture rsk-07  路  3Comments

atwj picture atwj  路  4Comments