I'm curious to see/discuss what are the best options to deploy new models during runtime. I learned from the tutorials and examples how new versions of a model are loaded as well as how to load multiple models (at the start of the server). Unfortunately I could not find the functionality of loading new models during runtime. (I'm aware that parts where discussed in this issue #380.) I guess my first question is if there exists already any type of dynamic loading of new models during runtime? (I think technically all components should be there, however I could not find a 'trigger' which starts the process of loading the model.)
In general I can think of several ways to serve multiple models during runtime:
It would be great if some of the developers could post their opinion on how such a case should be handled or people share their opinion on best practices related to deploying new models.
Probably the best would be to extend model_servers/main.cc to accept new ModelServerConfigs (perhaps via an rpc call), and call ServerCore::ReloadConfig().
I am also working on dynamically loading new models. Is this -- i.e. using RPC to ReloadConfig -- still the recommended route for adding new models?
If so, one question: when we call ReloadConfig, is there a downtime when the models get reloaded or are the models replaced only when new models are loaded and ready to serve?
Correct, that is still the recommended approach.
During ReloadConfig() any already-loaded models will remain loaded and can take traffic the whole time.
I have managed to get it working by adding a Reload endpoint to PredictionServiceImpl in main.cc. Would you be interested in getting these changes as a pull request?
@rajhans - That sounds good, but can you please define a separate service for this new RPC endpoint, rather than adding it to PredictionService?
Also, it would be good for the new config to be passed in-line in the rpc request, rather than having the rpc pass a filename for the config.
Lastly, there's the issue of RPC timeouts. ServerCore::ReloadConfig() can block for a long time in general, since some models take a long time to load. I don't think we want the RPC request to hang for so long. Instead, should we make it asynchronous: the RPC validates the config and then kicks off ReloadConfig() and returns immediately, letting the ReloadConfig() work proceed in the background. WDYT?
Jumping into this thread as I ended up implementing something similar(another service which accepts a gRPC call with a ModelConfig proto, rebuilds the config then calls ReloadConfig).
.
@chrisolston - I think when you refresh the config you actually want to know the models were loaded, as the immediate action afterwards is usually to forward calls. If you don't know the reload has ended, you'd have to pull until the new model is loaded(and also what happens if reloading the config failed?)
Another small problem I faced is that https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/servable_state_monitor.cc#L190 does not accept a timeout. I think ideally you'd configure a timeout(maybe a longer one) for the ReloadConfig call, and possibly rollback afterwards.
If folks prefer the RPC to be synchronous and potentially block for a long time (could be tens of minutes for very large models) I'm fine with that. I agree it would be easier to use, and also much easier to implement, but it's unusual to have such huge RPC timeouts and some environments may not handle that case well.
SGTM to add an (optional) timeout to WaitUntilServablesReachState().
There is also an interesting comment in a pull request: https://github.com/tensorflow/serving/pull/294
In general, I think it is possible to do the changes, even if my c++ skills are below grasshopper level.
With the changes we can add new models during runtime, but does anyone see a way how we could remove models during runtime? In the code it seems to me, that the server will always keep all previously loaded models.
The semantics of ServerCore::ReloadConfig() is that the new config supersedes the old one. So if the old config has models {A, B} and the new one has only {B}, model A will get unloaded.
Thanks for your comments Chris. To summarize, how does the following PR sound to you:
1) New service with a ReloadConfig method.
2) Synchronous RPC call to ReloadConfig. Going with synchronous for now as it is easier to understand.
Would you also like to add a timeout in the same PR or ok to submit above?
it would be great, when the new serve could handle the loaded models itself. So that we have an add- and a remove-endpoint. In this way, we just need to add new configs and don't need to handle the loaded models externally.
@rajhans - Your plan sounds good to me. Please add the timeout to ReloadConfig() in a separate PR (which would go first).
@joergkiesewetter - That's a good idea. I propose getting the basic service with ReloadConfig in and working, and then considering the delta (add/delete model) API as a possible follow-on PR.
Updating this thread with the progress I have so far:
1) I have managed to implement and test an "AdminService" that accepts ReloadConfig RPC calls.
2) Did not have the cycles until very recently to look into adding timeout to WaitUntilServablesReachState. I took a stab at adding timeout; however before I move forward, would love some early comments on the code so far. Here is the pr #718 -- does this look right?
Also, I'd like suggestions on testing this.
Hi Guys,
I see this PR is about managing model loading by API call into the server, that would be great to have!
Currently though is there a way to configure running more than one version of the same model
at the same time?
Is there a way to make it phase out older model versions, that are no longer being used or to have some limit on how many latest versions would be active, is that possible to configure somehow?
My current config that works, looks something like this:
model_config_list: {
config: {
name: "my_model1",
base_path: "/data/tf_models/tf_model_1",
model_platform: "tensorflow"
},
config: {
name: "my_model2",
base_path: "/data/tf_models/tf_model_2",
model_platform: "tensorflow"
}
}
@vitalyli , Yes there a way, but it's pooly documented. You can start by looking at this code
https://github.com/tensorflow/serving/tree/master/tensorflow_serving/config/model_server_config.proto
And also this code
https://github.com/tensorflow/serving/tree/master/tensorflow_serving/sources/storage_path/file_system_storage_path_source.proto
then you have to declare a ModelServingPolicy in your config
You config might end up looking like this:
model_config_list: {
config: {
name: "mymodel",
base_path: "/some/filesystem/path",
model_platform: "tensorflow",
model_version_policy: {
specific: {
versions: 101,
versions: 202
}
}
}
}
Or if you want to retire old versions, you can use this to keep the N versions :
model_config_list: {
config: {
name: "mymodel",
base_path: "/some/filesystem/path",
model_platform: "tensorflow",
model_version_policy: {
latest: {
num_versions: N
}
}
}
}
FYI I believe this thread is closely related with https://github.com/tensorflow/serving/issues/537
Hi,
I have question regarding loading / unloading triggers.
Let say, I may potentially have to use 10 models but only 3 would fit in the GPU memory.
How can we easily unload the "idle" models and keep only the last 3 actually used.
What is the current TF serving behavior when we specify N models in the config file and if the N models do not fit in the GPU ram ?
Thks.
Hi @chrisolston, is the second item (delta api) in this comment implemented? https://github.com/tensorflow/serving/issues/422#issuecomment-340527225
@yuzheng21 No we wound up concluding that it isn't a good idea because (1) it relies on using a Model Server binary's in-memory (not persisted anywhere) state as the reference for the delta, and (2) the query-and-update pattern is race-prone. Instead, the recommended approach is to store the configuration somewhere reliable (e.g. a database or redundant file system), and use that to convert deltas into new configs to send to the Model Server.
Most helpful comment
Correct, that is still the recommended approach.
During ReloadConfig() any already-loaded models will remain loaded and can take traffic the whole time.