Serving: Time out error when serving models due to update from 1.8 to 1.9

Created on 21 Jul 2018  Â·  7Comments  Â·  Source: tensorflow/serving

First of all, thx for your work in TF serving!

We recently updated both tensorflow and tensorflow serving from 1.8.0 to 1.9.0.

Now we have one model that cannot be served anymore in the sense that requesting
predictions via the REST API results in a ""Timed out waiting for notification"" error.

The results of my debugging suggest a problem on the serving side.

It is not a problem of the SavedModel, as I can get predictions from the same saved_model via:

from tensorflow.contrib import predictor

predict_fn = predictor.from_saved_model(saved_model_dir)
predictions = predict_fn({'input': ...}
...

For the record, version combinations I have tried (always using the same code):

  • train models with TF 1.8, serve with TF 1.8: all models work
  • train models with TF 1.9, serve with TF 1.9: one model fails
  • train models with TF 1.9, serve with TF 1.8: at least one model fails because it uses (internally, user code unchanged!) TF 1.9 functionality that is not available in serving TF 1.8.

Note: We use the server installed via apt-get, but I observed the same behavior with the official
TF serving docker images.

performance

Most helpful comment

@imkhan1104 I fixed the same problem by passing in the flag --rest_api_timeout_in_ms when I run the model server

All 7 comments

see #996 -- we had a recent issue with both 1.8 and 1.9 binaries timing out and we (re)released the binaries and docker images yesterday. unsure if you are using the old or new (re)released 1.9 binaries.

can you please verify that this is still happening with latest binaries:

download the DEB package from here (and install it, without using apt-get):
https://storage.googleapis.com/tensorflow-serving-apt/pool/tensorflow-model-server-1.9.0/t/tensorflow-model-server/tensorflow-model-server_1.9.0_all.deb

or (re)pull and use tensorflow/serving:latest image from https://hub.docker.com/r/tensorflow/serving/

if it still happens, server logs, complete request+response details, model details (size, type (regression or cnn or etc)) would be helpful to root cause.

Thx for your quick reply @netfs ! The linked to issue slipped through my search.
I was indeed using the old (from before EOD July 19) binaries/images. Both of your proposed solutions work!

Hi there @netfs im still having "Timed out waiting for notification" error when serving with REST API. I have tried seven different images from tensorflow/serving but none of them seem to work. I can receive other error messages just fine though.
This is my request script:

         image_data = base64.b64encode(imageFile.read())
         b64_string = image_data.decode("utf-8")

values = {
          "signature_name": "predict_post",
          "instances":[
                  {"b64":b64_string}
                ]
          }
json_data = json.dumps(values)
json_response = requests.post(url, json_data, stream=True)

response = json_response.raw.read()

If it helps, this is the log after I run the docker command for serving:

sudo docker run -p 8501:8501   --mount type=bind,source=$(pwd)/models/horus/2,target=/models/horus/2 -e MODEL_NAME=horus -t tensorflow/serving:1.10.1
2018-09-13 06:42:22.594423: I tensorflow_serving/model_servers/main.cc:157] Building single TensorFlow model file config:  model_name: horus model_base_path: /models/horus
2018-09-13 06:42:22.594728: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2018-09-13 06:42:22.594760: I tensorflow_serving/model_servers/server_core.cc:517]  (Re-)adding model: horus
2018-09-13 06:42:22.695308: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: horus version: 2}
2018-09-13 06:42:22.695381: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: horus version: 2}
2018-09-13 06:42:22.695399: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: horus version: 2}
2018-09-13 06:42:22.695422: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /models/horus/2
2018-09-13 06:42:22.695472: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/horus/2
2018-09-13 06:42:22.707615: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2018-09-13 06:42:22.761963: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:113] Restoring SavedModel bundle.
2018-09-13 06:42:24.474513: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:148] Running LegacyInitOp on SavedModel bundle.
2018-09-13 06:42:24.500590: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:233] SavedModel load for tags { serve }; Status: success. Took 1805105 microseconds.
2018-09-13 06:42:24.500653: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:83] No warmup data file found at /models/horus/2/assets.extra/tf_serving_warmup_requests
2018-09-13 06:42:24.500778: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: horus version: 2}
2018-09-13 06:42:24.504172: I tensorflow_serving/model_servers/main.cc:327] Running ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2018-09-13 06:42:24.505593: I tensorflow_serving/model_servers/main.cc:337] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 235] RAW: Entering the event loop ...

The model that im running is an object detection faster-rcnn model that accepts images as inputs and will output classes and probabilities. The model size is 500+mb and i have tried to find the docker container logs but failed.

Any help is much appreciated. Thank you.

@imkhan1104 I fixed the same problem by passing in the flag --rest_api_timeout_in_ms when I run the model server

The same here using runtimeVersion 1.12. It only happens from time to time (≈5% of total API calls).

Prediction failed: { "error": "Timed out waiting for notification" }

nice and clear, but how to set that flag if im using docker ? @varunarora @edumotya

@Adblu I had to do a custom Docker image build using their existing Dockerfile. Wasn't fun :/ But it isn't the worst thing in the world.

Was this page helpful?
0 / 5 - 0 ratings