We use deep learning to process very large images for manufacturing. Images come at a very high rate so we load-balance appropriately. Because of our performance requirements, we tweak everything to squeeze every last millisecond we can out of our deep learning inference.
We have been using a different machine learning framework but are now researching TensorFlow. we want to have a machine that serves up inferences to input data that and has multiple GPU cards. The inferences are made by a model with high GPU RAM requirements. These inferences are consumed by multiple machines on the same LAN.
The input data is high volume and needs to be processed in real-time so there is a need for load-balancing among the GPU cards. The machines that consume these inferences are load-balanced and more machines are added as the load increases. Ideally you would add GPU cards to the single machine until it reaches its limit and then add another GPU machine.
How do you load-balance TensorFlow inferences in such a setup?
Any updates on this? Incredibly practical problem statement.
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow-serving
If you open a GitHub issue, it must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
Thanks!
@gautamvasudevan Issue has been open for even longer at https://stackoverflow.com/questions/48104939/load-balancing-for-real-time-production-environments-in-tensorflow but still no answer from your team.
An external load balancer program should be enough, no need to implement in tensorflow-serving.
Most helpful comment
Any updates on this? Incredibly practical problem statement.