Serving: Serving models from S3 causes high costs due to excessive ListBucket operations

Created on 26 Mar 2019  路  12Comments  路  Source: tensorflow/serving

Bug Report

System information

  • TensorFlow Serving installed from (source or binary): official Docker image tensorflow/serving
  • TensorFlow Serving version: 1.12.0

Describe the problem

We have deployed a couple of models where both the model config file and the model are stored in S3. In our case, we are seeing a usage of about 150MB/h for the ListBucket operation on the bucket in which the models and configuration are stored.
This adds up to a cost of about 1.000 USD per month.

Exact Steps to Reproduce

  • deploy model and configuration to S3
  • run tensorflow-serving like this: tensorflow_model_server --model_config_file=s3://modelbucket/model.yaml
  • after a couple of hours, download AWS usage report from here: https://console.aws.amazon.com/billing/home#/reports/usage
  • Useful Filters:

    • Services: Amazon Simple Storage Service

    • Operation: ListBucket

    • Report Granularity: Hours

Expected behavior

I expect that this operation, if at all necessary, is done only once: At startup.
If this is done to verify the existence of a file, the same can be achieved which a much cheaper HEAD request. Please also see how a similar issue was fixed in boto: https://github.com/boto/boto/issues/2078

contributions welcome bug

Most helpful comment

i suspect this is due to filesystem polling to detect+load new versions of servables as they become. you can disable or reduce the frequency by setting --file_system_poll_wait_seconds flag on the modelserver. the default is 1 second (afair).

happy to review patches to improve/optimize this path.

All 12 comments

i suspect this is due to filesystem polling to detect+load new versions of servables as they become. you can disable or reduce the frequency by setting --file_system_poll_wait_seconds flag on the modelserver. the default is 1 second (afair).

happy to review patches to improve/optimize this path.

I'll set it to -1 and will report back tomorrow if we see any change in the amount of these API calls.

Like I already mentioned, ListBucket should be a last resort action. Checking for file modification or existence can be achieved with HEAD requests and is >10x cheaper. I sadly can't suggest a code modification as this portion of the code comes from tensorflow itself so I don't have a good overview what other code-paths we would break modifying it for the use-case of tensorflow-serving.

After passing --file_system_poll_wait_seconds=2147483647 (MAXINT32, when setting it to -1 the server doesn't start up completely) the amount of ListBucket calls went down.
I'd consider this a workaround as the default setting leads to an unexpected behavior.

@joekohlsdorf Thanks for the workaround.
Closing this issue for now and will reopen if there are any contributions posted for the same.

Thanks @joekohlsdorf. Can you please open a feature request OR rename (as a FR)+open this issue so we can keep track (and hopefully someone can help make relevant changes).

Thanks!

Reopening in order to track this issue.

I don't agree that this is a feature request, the wrong method for checking if a file has changed is used and it is causing 12.5x the expected cost. In my case over USD 1.000 per month!

This is standard behaviour and the result is completely unexpected.

Closing this issue as per the discussion in #1295

Why was this issue closed? These are two separate problems.
This issue is about Tensorflow Serving using the wrong S3 API call to check if a file has changed.
The other issue about not being able to disable automatic model reloading.

@unclepeddy Please correct me if I misunderstood your comments in #1295

@hgadig no worries, my mistake for asking you to close this without an explanation .

@joekohlsdorf sorry about the expense you had to bear but I don't think there is any way to avoid this - please feel free to correct my understanding as I don't have much experience with S3 features.

First, and just to be clear, we poll the model directory provided to ModelServer in order to discover new model versions (not to look for "file modification or existence").

Second, you mention using HEAD instead of LIST - this is in fact the case for when we only care about existence (starting here calling into core TF s3 implementation to ensure the parent model directory exists before proceeding to check its children).
However, we then need to list the child directories of model_dir to find all the model versions that are to be aspired here which calls into core TF s3 implementation which calls ListObjects. I'm assuming that's where the ListBucket operations are happening (not sure why the API name and what's reported on your usage report don't match). My point is, I don't think there's a way to avoid listing the child directories as that is how we find the versions to be aspired. This is well documented here

Third, TFS cannot be using "the wrong S3 API" since it doesn't differentiate between different FS environments - it relies on TF core/platform/env library for this. Therefore given that we have to list directory children and we don't own the S3 client implementation, I unfortunately cannot think of a fix for this costly behavior, other than changing the polling frequency.

Let me know if you disagree!

@joekohlsdorf came across this post as I was weighing a few different deployment strategies myself. Was wondering if you still had issues with this. And also, if a hybrid s3/non-s3 solution would work.

The hybrid solution I was thinking about would include hosting the saved_models in s3, but have the model.config file hosted on the server itself. So you could have the server look for updates every second (or whatever you want), but would then only make s3 requests when it needs the model.

I haven't implemented this, just a thought. So if there are blockers/disadvantages to this feel free to list those.

Was this page helpful?
0 / 5 - 0 ratings