Serving: Deploying a single model with high # of requests.

Created on 8 Aug 2018  路  3Comments  路  Source: tensorflow/serving

I'm a bit confused with how TF Serving handles a specific case.

The situation: I have one EC2 Instance (no GPU). I have many requests coming in for inference. With TF Serving, should I load up the model _once_ and have everyone make requests to the same model? My thinking here is that when lots of requests are coming in, that model may be busy often and slow down when lots of people are hitting it.

Or, should I load the same model _multiple_ times so that the load can be balanced?

Help would be awesome :).

performance

Most helpful comment

@gautamvasudevan dude this is also a platform if you build a product then better help the customer of it.

All 3 comments

One solution is to write a Flask/Gunicorn server script which spawn multiple workers and communicate with TensorFlow model with the serving API. In this way, you can manage multiple requests with same TF model!

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow-serving

If you open a GitHub issue, it must be a bug, a feature request, or a
significant problem with documentation (for small docs fixes please send
a PR instead).

Thanks!

@gautamvasudevan dude this is also a platform if you build a product then better help the customer of it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rsk-07 picture rsk-07  路  3Comments

marcoadurno picture marcoadurno  路  3Comments

cchung100m picture cchung100m  路  4Comments

prateekgupta11 picture prateekgupta11  路  4Comments

waichee picture waichee  路  4Comments