Serving: Deploying a single model with high # of requests.

Created on 8 Aug 2018 · 3Comments · Source: tensorflow/serving

I'm a bit confused with how TF Serving handles a specific case.

The situation: I have one EC2 Instance (no GPU). I have many requests coming in for inference. With TF Serving, should I load up the model _once_ and have everyone make requests to the same model? My thinking here is that when lots of requests are coming in, that model may be busy often and slow down when lots of people are hitting it.

Or, should I load the same model _multiple_ times so that the load can be balanced?

Help would be awesome :).

performance

Source

farzaa

👍1

Most helpful comment

@gautamvasudevan dude this is also a platform if you build a product then better help the customer of it.

gr8Adakron on 4 Oct 2019

👍3

All 3 comments

One solution is to write a Flask/Gunicorn server script which spawn multiple workers and communicate with TensorFlow model with the serving API. In this way, you can manage multiple requests with same TF model!

spate141 on 9 Aug 2018

👍1

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow-serving

If you open a GitHub issue, it must be a bug, a feature request, or a
significant problem with documentation (for small docs fixes please send
a PR instead).

Thanks!