Serving: Best practice: loading new models during runtime (not versions)

Created on 26 Apr 2017 · 19Comments · Source: tensorflow/serving

I'm curious to see/discuss what are the best options to deploy new models during runtime. I learned from the tutorials and examples how new versions of a model are loaded as well as how to load multiple models (at the start of the server). Unfortunately I could not find the functionality of loading new models during runtime. (I'm aware that parts where discussed in this issue #380.) I guess my first question is if there exists already any type of dynamic loading of new models during runtime? (I think technically all components should be there, however I could not find a 'trigger' which starts the process of loading the model.)

In general I can think of several ways to serve multiple models during runtime:

Use a separate server for each model. A small manager module could start a new server or even docker container for each model. (Question: How to best forward the prediction request to the right server?)
Similar to watching multiple versions on the file system, we could watch the parent directory if a new folder for a new model.
If we use the 'custom_model_config' we periodically check it for changes and see if a new model is in there (or one is missing).
Have a gRPC call which sends the required information for the new model.

It would be great if some of the developers could post their opinion on how such a case should be handled or people share their opinion on best practices related to deploying new models.

contributions welcome

Source

markusnagel

👍8

Most helpful comment

Correct, that is still the recommended approach.

During ReloadConfig() any already-loaded models will remain loaded and can take traffic the whole time.

chrisolston on 16 Oct 2017

👍7

All 19 comments

Probably the best would be to extend model_servers/main.cc to accept new ModelServerConfigs (perhaps via an rpc call), and call ServerCore::ReloadConfig().

chrisolston on 1 May 2017

I am also working on dynamically loading new models. Is this -- i.e. using RPC to ReloadConfig -- still the recommended route for adding new models?

If so, one question: when we call ReloadConfig, is there a downtime when the models get reloaded or are the models replaced only when new models are loaded and ready to serve?

rajhans on 16 Oct 2017

Correct, that is still the recommended approach.

During ReloadConfig() any already-loaded models will remain loaded and can take traffic the whole time.

chrisolston on 16 Oct 2017

👍7

I have managed to get it working by adding a Reload endpoint to PredictionServiceImpl in main.cc. Would you be interested in getting these changes as a pull request?

rajhans on 18 Oct 2017

@rajhans - That sounds good, but can you please define a separate service for this new RPC endpoint, rather than adding it to PredictionService?

Also, it would be good for the new config to be passed in-line in the rpc request, rather than having the rpc pass a filename for the config.

Lastly, there's the issue of RPC timeouts. ServerCore::ReloadConfig() can block for a long time in general, since some models take a long time to load. I don't think we want the RPC request to hang for so long. Instead, should we make it asynchronous: the RPC validates the config and then kicks off ReloadConfig() and returns immediately, letting the ReloadConfig() work proceed in the background. WDYT?

chrisolston on 19 Oct 2017

Jumping into this thread as I ended up implementing something similar(another service which accepts a gRPC call with a ModelConfig proto, rebuilds the config then calls ReloadConfig).
.
@chrisolston - I think when you refresh the config you actually want to know the models were loaded, as the immediate action afterwards is usually to forward calls. If you don't know the reload has ended, you'd have to pull until the new model is loaded(and also what happens if reloading the config failed?)
Another small problem I faced is that https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/servable_state_monitor.cc#L190 does not accept a timeout. I think ideally you'd configure a timeout(maybe a longer one) for the ReloadConfig call, and possibly rollback afterwards.

nvgoldin on 20 Oct 2017

If folks prefer the RPC to be synchronous and potentially block for a long time (could be tens of minutes for very large models) I'm fine with that. I agree it would be easier to use, and also much easier to implement, but it's unusual to have such huge RPC timeouts and some environments may not handle that case well.

SGTM to add an (optional) timeout to WaitUntilServablesReachState().

chrisolston on 20 Oct 2017

There is also an interesting comment in a pull request: https://github.com/tensorflow/serving/pull/294

In general, I think it is possible to do the changes, even if my c++ skills are below grasshopper level.

With the changes we can add new models during runtime, but does anyone see a way how we could remove models during runtime? In the code it seems to me, that the server will always keep all previously loaded models.

joergkiesewetter on 27 Oct 2017

The semantics of ServerCore::ReloadConfig() is that the new config supersedes the old one. So if the old config has models {A, B} and the new one has only {B}, model A will get unloaded.

chrisolston on 27 Oct 2017

Thanks for your comments Chris. To summarize, how does the following PR sound to you:
1) New service with a ReloadConfig method.
2) Synchronous RPC call to ReloadConfig. Going with synchronous for now as it is easier to understand.

Would you also like to add a timeout in the same PR or ok to submit above?

rajhans on 30 Oct 2017

it would be great, when the new serve could handle the loaded models itself. So that we have an add- and a remove-endpoint. In this way, we just need to add new configs and don't need to handle the loaded models externally.

joergkiesewetter on 30 Oct 2017

@rajhans - Your plan sounds good to me. Please add the timeout to ReloadConfig() in a separate PR (which would go first).

@joergkiesewetter - That's a good idea. I propose getting the basic service with ReloadConfig in and working, and then considering the delta (add/delete model) API as a possible follow-on PR.

chrisolston on 30 Oct 2017

Updating this thread with the progress I have so far:
1) I have managed to implement and test an "AdminService" that accepts ReloadConfig RPC calls.
2) Did not have the cycles until very recently to look into adding timeout to WaitUntilServablesReachState. I took a stab at adding timeout; however before I move forward, would love some early comments on the code so far. Here is the pr #718 -- does this look right?
Also, I'd like suggestions on testing this.

rajhans on 8 Jan 2018

Hi Guys,

I see this PR is about managing model loading by API call into the server, that would be great to have!

Currently though is there a way to configure running more than one version of the same model
at the same time?
Is there a way to make it phase out older model versions, that are no longer being used or to have some limit on how many latest versions would be active, is that possible to configure somehow?

My current config that works, looks something like this:
model_config_list: {
config: {
name: "my_model1",
base_path: "/data/tf_models/tf_model_1",
model_platform: "tensorflow"
},
config: {
name: "my_model2",
base_path: "/data/tf_models/tf_model_2",
model_platform: "tensorflow"
}
}

vitalyli on 5 Feb 2018

@vitalyli , Yes there a way, but it's pooly documented. You can start by looking at this code
https://github.com/tensorflow/serving/tree/master/tensorflow_serving/config/model_server_config.proto
And also this code
https://github.com/tensorflow/serving/tree/master/tensorflow_serving/sources/storage_path/file_system_storage_path_source.proto

then you have to declare a ModelServingPolicy in your config

You config might end up looking like this:

model_config_list: {
  config: {
    name: "mymodel",
    base_path: "/some/filesystem/path",
    model_platform: "tensorflow",
    model_version_policy: {
       specific: {
        versions: 101,
        versions: 202
       }
    }
  }
}

Or if you want to retire old versions, you can use this to keep the N versions :

model_config_list: {
  config: {
    name: "mymodel",
    base_path: "/some/filesystem/path",
    model_platform: "tensorflow",
    model_version_policy: {
       latest: {
        num_versions: N
       }
    }
  }
}

quantumlicht on 5 Feb 2018

👍5 ❤1

FYI I believe this thread is closely related with https://github.com/tensorflow/serving/issues/537

chrisolston on 26 Feb 2018

Hi,
I have question regarding loading / unloading triggers.
Let say, I may potentially have to use 10 models but only 3 would fit in the GPU memory.
How can we easily unload the "idle" models and keep only the last 3 actually used.

What is the current TF serving behavior when we specify N models in the config file and if the N models do not fit in the GPU ram ?

Thks.

vince62s on 3 Mar 2018

Hi @chrisolston, is the second item (delta api) in this comment implemented? https://github.com/tensorflow/serving/issues/422#issuecomment-340527225

yuzheng21 on 14 Sep 2018

@yuzheng21 No we wound up concluding that it isn't a good idea because (1) it relies on using a Model Server binary's in-memory (not persisted anywhere) state as the reference for the delta, and (2) the query-and-update pattern is race-prone. Instead, the recommended approach is to store the configuration somewhere reliable (e.g. a database or redundant file system), and use that to convert deltas into new configs to send to the Model Server.

chrisolston on 15 Sep 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings