I'm a bit confused with how TF Serving handles a specific case.
The situation: I have one EC2 Instance (no GPU). I have many requests coming in for inference. With TF Serving, should I load up the model _once_ and have everyone make requests to the same model? My thinking here is that when lots of requests are coming in, that model may be busy often and slow down when lots of people are hitting it.
Or, should I load the same model _multiple_ times so that the load can be balanced?
Help would be awesome :).
One solution is to write a Flask/Gunicorn server script which spawn multiple workers and communicate with TensorFlow model with the serving API. In this way, you can manage multiple requests with same TF model!
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow-serving
If you open a GitHub issue, it must be a bug, a feature request, or a
significant problem with documentation (for small docs fixes please send
a PR instead).
Thanks!
@gautamvasudevan dude this is also a platform if you build a product then better help the customer of it.
Most helpful comment
@gautamvasudevan dude this is also a platform if you build a product then better help the customer of it.