Serving: Latency high after loading a new model.

Created on 30 Mar 2017 · 8Comments · Source: tensorflow/serving

I'm using Tensorflow Serving load a widendeep model as online predict service, and the model will update every 10 minutes, we found that the first few requests' latency are high right after the new model is loaded, is this a known issue or any suggestion to figure out this problem?

performance

Source

xiaop1987

Most helpful comment

Hi, I have the exact same problem. However, I do not understand how one can add initialization or dummy "warm-up requests" to the init op (I used Keras for training and the SavedModelBuilder for exporting the model). Can you please explain it in more detail, e.g. with a code example?

Thanks!

eldonaldo on 5 May 2018

👍4

All 8 comments

Some tensorflow graphs perform lazy initialization, making the first request (or few requests) to a newly-loaded model slow. The best way to handle that is to add initialization or dummy "warm-up requests" to the init op which tf-serving calls while loading the model.

chrisolston on 30 Mar 2017

👍3

@chrisolston Thanks for your explanation and suggestion very much, problem is clear to me.
Here is some suggestions for tf-serving loading model.
a) May tf-serving add a warm up option:
1. we can store a request for each model when request first arrived,
2. when new version of model was loaded, do not make it ready util it is warmed up by the stored request.

b) Add lazy-loading model:
1. For we may start hundreds of tf-serving process, and they start loading and updating new version of model almost the same time, these situation may make the cluster's network and disk quite busy(the model is stored on HDFS), and
make the cluster unstable.
2. So we can loading/updating the model at random time in a specified period to make the network and disk more smooth.

xiaop1987 on 3 Apr 2017

For (a), the recommended approach is to do it within the tf graph, triggered from tf-serving calling the init op during load.

For (b), interesting idea. I would expect various I/O queues to smooth it out anyway but maybe you are hitting timeouts? You could write a custom SourceAdapter that acts as the identity function but adds a random delay -- that would do the trick. Feel free to contribute the SourceAdapter via a PR.

chrisolston on 3 Apr 2017

Thanks!

eldonaldo on 5 May 2018

👍4

Ping @chrisolston

eldonaldo on 24 May 2018

same problem

ydp on 28 May 2018

Hi @chrisolston , I have the same problem, can you provide an example on how to call the init op during load?

weberxie on 28 May 2018

Hi @chrisolston，Current version of tf serving try to load warmup request from tf_serving_warmup_requests file. I wonder if tensorflow provides common api to export request to the location or not? Or should we write request to the location manually?