I'm trying to load a model from AWS S3 with a model config file. Therefore I have config file which looks like this
model_config_list: {
config: {
name: "test1",
base_path: "s3://my-dev-models/5/",
model_platform: "tensorflow"
}
}
the credentials- and the config-file is placed in ~/.aws/. When I start the server everything seems to be fine, but I get some strange error messages from the console. During the download of the model, I get the error messages:
2018-03-01 14:34:37.248216: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:37.571128: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:37.592931: E external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-03-01 14:34:37.593090: W external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-03-01 14:34:37.593440: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:37.620189: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:37.667426: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:38.079607: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:38.492179: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:38.571143: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:38.590718: E external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-03-01 14:34:38.590847: W external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-03-01 14:34:38.591141: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:38.622214: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-03-01 14:34:38.905253: I external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
These error messages come in every second, as long as the download is running. When the download is finshed, the model gets loaded and I can send request to it, but the error messages are still popping up. But this time, the messages with Connection has been released. Continuing. only comes three times per second.
When the download is finshed, there is another specific error block:
2018-03-01 15:21:35.954905: W external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc:57] Encountered Unknown AWSError
InvalidRange
The requested range is not satisfiable:
2018-03-01 15:21:36.082429: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-03-01 15:21:36.346559: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:161] Restoring SavedModel bundle.
2018-03-01 15:21:36.387250: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:171] The specified SavedModel has no variables; no checkpoints were restored.
2018-03-01 15:21:36.387343: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:196] Running LegacyInitOp on SavedModel bundle.
2018-03-01 15:21:36.434209: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:291] SavedModel load for tags { serve }; Status: success. Took 150765459 microseconds.
The message seems like there is a connection problem with AWS. It seems like something is requesting something but receives a 404 error. But the strange thing is, that the model get's downloaded. I can also use the AWS CLI to query the bucket and see the model file. I'm using the latest commit f0b3a41 to serve my models.
I thought, probably there is a bug in tensorflow itself, which is already fixed, so I also tried to set the dependent commit to the latest 6fdb9ad1baf7686a75f9e660178f7ac595e7fc2e. But it didn't helped.
Does anyone has a clue were the problem could be and how to get rid of these error messages?
I've faced the same problem. After investigating I found the reason, when Tensorflow s3 client checks if directory exists it firstly run Stat on it since s3 have no possibility to check whether directory exists . Then it checks if key with such name exists and fails with such error messages. So we see errors in log. For me I did workaround setting loglevel from s3 client to FATAL in this line ignoring Tensorflow log level.
I suppose there is no right way to avoid this issue because it is intended behavior if you need logs.
Was this closed without resolution? I am hitting this error now. Is it a problem with the logging? or with the aws policy?
Closing this seems premature indeed. I'm also running into these messages.
Maybe we can override AWS s3 client logging level...
I started investigating @notimesea 's suggestion to patch the log level but put it aside when I realised the TFS build pulls in an archive for the main tensorflow codebase. Is there an easy way to point the build to a locally patched copy of tensorflow?
I did it a bit dirty but it worked.
# tensorflow commmit hash
export TF_REVISION="d2e24b6039433bd83478da8c8c2d6c58034be607"
# tensorflow_serving commmit hash
export TF_SERVING_REVISION="79d3354cd8a408fb981670512d661c2b7e89f5bf"
git clone --recursive https://github.com/tensorflow/serving && \
cd serving && \
git checkout $TF_SERVING_REVISION && \
bazel build -c opt --copt='-march=native' --copt='-O3' --action_env TF_REVISION=$TF_REVISION tensorflow_serving/model_servers:tensorflow_model_server && \
# disable logs from s3 in tensorflow
sed -i 's/tensorflow::internal::MinLogLevelFromEnv()/FATAL/' bazel-serving/external/org_tensorflow/tensorflow/core/platform/s3/aws_logging.cc && \
bazel build -c opt --copt='-march=native' --copt='-O3' --action_env TF_REVISION=$TF_REVISION tensorflow_serving/model_servers:tensorflow_model_server
@notimesea thanks! I'll incorporate that into our build so I can stop embedding models into docker images...
Hi! facing this issue now. Did this move forward?
Sorry we found that the missing S3 permissions fixed the problem: "s3:DeleteObject"
Closing as this is resolved
I really don't see how this is resolved. Can you elaborate on your fix @elgalu? I've tried running this with an admin service account with no luck.
Can this be fixed in upsteam? I just set TF_CPP_MIN_LOG_LEVEL=3 as workaround.
I dont think this is resolved at all. pulling the latest tfs still encounters this error with an admin key
After looking in the code again, I can see that this is not really a bug. It is the intended behavior.
The reason for the requests is, that the serving container looks for a new version to load once a second. This should be the feature to autoload new model version once they are available.
The corresponding code is around sources/storage_path/file_system_storage_path_source.cc::360.
Unfortunately, nothing helps to prevent the serving container from polling for new versions. We have tried to call the polling method out of the PeriodicFunction, but then no model was loaded. We have also tried to set the servableVersionPolicy of the model to 'specific', but the serving container still polls for new versions.
The only thing what not really fixed the problem but decreased the number of requests was to increase the polling interval itself to i.e. 1h. This can be done in model_servers/main.cc::main().file_system_poll_wait_seconds or with a flag on startup with the same name.
One downside of this 'solution' is, that it could take up to one hour to load a new model. New Versions and models are only loaded when the PeriodicFunction runs. When it just runs once an hour, only in this interval new models are loaded.
Sure, I understand that it is logging the aws failure due to looking for a new model version: however spamming a Warning and Errors on expected behaviour is not really desired, so I do think this is a bug, its just an un-handled case.
I hate to bring up an issue and not offer a viable solution... but realistically we should be suppressing those logs and pushing some debug logs specifying that a new version could not be found; rather than allowing the aws logger to propagate an intended Error log.
more than that - i have problem changing the code and compiling like suggested above with periodic function. i would expect this to be TFS configuration.
does someone know if there is such feature on the roadmap ?
make sure you set undocumented AWS_LOG_LEVEL env var or else the TF s3 plugin will spam your logs, eg: export AWS_LOG_LEVEL=3 for FATAL
make sure you set undocumented
AWS_LOG_LEVELenv var or else the TF s3 plugin will spam your logs, eg:export AWS_LOG_LEVEL=3forFATAL
This would only hide the real problem in my opinion - which is querying AWS every second.
this needs to be changed.
Used this command docker run -p 9000:8500 -e AWS_ACCESS_KEY_ID=xxx -e AWS_SECRET_ACCESS_KEY=xxx -e MODEL_BASE_PATH='s3://aml-models/tf-serving/models' -e MODEL_NAME=model_name S3_ENDPOINT=s3.us-east-2.amazonaws.com -e AWS_REGION=us-east-2 -e TF_CPP_MIN_LOG_LEVEL=3 -t tensorflow/serving:1.11.0
Error If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
Has anybody solved it ??
Used this command
docker run -p 9000:8500 -e AWS_ACCESS_KEY_ID=xxx -e AWS_SECRET_ACCESS_KEY=xxx -e MODEL_BASE_PATH='s3://aml-models/tf-serving/models' -e MODEL_NAME=model_name S3_ENDPOINT=s3.us-east-2.amazonaws.com -e AWS_REGION=us-east-2 -e TF_CPP_MIN_LOG_LEVEL=3 -t tensorflow/serving:1.11.0Error
If the signature check failed. This could be because of a time skew. Attempting to adjust the signer. [warn] getaddrinfo: address family for nodename not supported [evhttp_server.cc : 235] RAW: Entering the event loop ...Has anybody solved it ??
For me the TFServing http server starts up after that last logged line. Isn't that the case ?
where in tfserving code model from s3 is downloaded? i.e. where is this s3 connector?
make sure you set undocumented
AWS_LOG_LEVELenv var or else the TF s3 plugin will spam your logs, eg:export AWS_LOG_LEVEL=3forFATALThis would only hide the real problem in my opinion - which is querying AWS every second.
this needs to be changed.
As @shlomiken pointed out, making S3 API requests every second is an issue and can rack up a massive S3 bill. We opted to turn the feature altogether off by setting --file_system_poll_wait_second=2147483647. The --model_config_file_poll_wait_seconds can also have an impact on the number of S3 API requests. If the models are not expected to change or deploying models results in Tensorflow Serving restarting then it's also recommended to disable.
Most helpful comment
Sure, I understand that it is logging the aws failure due to looking for a new model version: however spamming a Warning and Errors on expected behaviour is not really desired, so I do think this is a bug, its just an un-handled case.
I hate to bring up an issue and not offer a viable solution... but realistically we should be suppressing those logs and pushing some debug logs specifying that a new version could not be found; rather than allowing the aws logger to propagate an intended Error log.