Sagemaker-python-sdk: Tensorflowmodel points to images that do not exist

Created on 7 Jul 2019 · 16Comments · Source: aws/sagemaker-python-sdk

Please fill out the form below.

System Information

Tensorflow:
Fails for all versions:
*Fails for py3 and py2:
Fails for CPU and GPU:
No custom image:

Describe the problem

If I try to deploy a pre-built model like so:
```{Python}
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model0100.tar.gz',
role = role,
framework_version='1.13', py_version='py3',
entry_point = 'train.py')

Will fail upon deploying:

```{Python}
predictor = sagemaker_model.deploy(initial_instance_count=1,
                                   instance_type='ml.p2.xlarge')

I receive:
```{Python}
ValueError: Error hosting endpoint sagemaker-tensorflow-2019-07-07-11-50-45-473: Failed Reason: The image '520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.13-gpu-py3' does not exist.

I can get past this error by specifying the image (which is not well-documented - took a lot of digging to find a link that worked):

```{Python}
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model0100.tar.gz',
                                  role = role,
                                  framework_version='1.13', py_version='py3',
                                  entry_point = 'train.py', image = '763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.13-gpu' )

Any idea how to solve this?

documentation question

Source

NoahDolev

👍5

Most helpful comment

Just some context.

There are two TensorFlow solutions that handle serving in the Python SDK.

They have different class representations and documentation as shown here.

TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47
Doc: https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#deploying-directly-from-model-artifacts
Key difference: Uses a proxy GRPC client to sent requests
Container impl: https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py
Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96
Doc: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst
Key difference: Utilizes the TensorFlow serving rest API
Container impl: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/container/sagemaker/serve.py

Python 3 isn't supported using the TensorFlowModel object, as the container uses the TensorFlow serving api library in conjunction with the GRPC client to handle making inferences, however the TensorFlow serving api isn't supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel object.

If you need Python 3 then you will need to use the Model object defined in #2 above. The inference script format will change if you need to handle pre and post processing. https://github.com/aws/sagemaker-tensorflow-serving-container#prepost-processing.

Also your inference requests will need to follow the TFS rest API.
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint

Since you train externally you're going to need to make sure your model artifacts follow the correct format. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#deploying-more-than-one-model-to-your-endpoint

Here is an example that does for the most part what you're trying to do. https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_serving_container/tensorflow_serving_container.ipynb

Sorry for the confusion and wall of text and links. Please let me know if there is anything I can clarify.

Thanks!

ChoiByungWook on 10 Jul 2019

👍8 🎉1

All 16 comments

Hi @NoahDolev, thank you for using SageMaker! From the code you provided, it seems you want to train your model with train.py?

In order to use TensorFlow script mode to train your model (and then deploy), you want to start with the Tensorflow Estimator class: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L188

You either set script_mode=True or py_version="py3" to enable script mode.

otter-bunny on 8 Jul 2019

Hi @ChuyangDeng ,

I am not sure that has anything to do with the issue I posted. I am reporting to you that the docker image which SageMaker searches for by default is not correct for eu-west-1. Also, script_mode is not a valid flag of TensorFlowModel. This flag exists only in TensorFlow to the best of my knowledge.

Best,
Noah

NoahDolev on 9 Jul 2019

Hi @NoahDolev,

Are you trying to do training or hosting here? Our TensorFlow script mode is only supported for training. And a TensoFlowModel class is for hosting, that's why the docker image uri is not correct (cannot be found).

If you are training your model, you should use TensorFlow estimator class so that you can train with our script mode image.

If you are deploying your trained model, you will use TensorFlowModel class, but no script mode is supported with deploying.

otter-bunny on 9 Jul 2019

@NoahDolev @ChuyangDeng I met the same error when I follow this link:
https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/
to deploy a pre-trained model in SageMaker with a different model. Since I am using py3 in my model, so I have to specify the image like this:

`sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
py_version='py3',
framework_version = '1.12',
entry_point = 'train.py')

predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.p2.xlarge')`

ValueError: Error hosting endpoint sagemaker-tensorflow-2019-07-10-05-06-02-075: Failed Reason: The image '520713654638.dkr.ecr.us-east-2.amazonaws.com/sagemaker-tensorflow:1.12-gpu-py3' does not exist.

When I delete py_version='py3' there is no error anymore.

yuchuang1979 on 10 Jul 2019

👍1

Hi @yuchuang1979 ,

Precisely what I am referring to. I am trying to deploy a model I trained elsewhere. You can also specify the image to solve the problem. My point, however, is that the default is pointing to the wrong docker image. It's a bug.

Best,
Noah

NoahDolev on 10 Jul 2019

😕1

@NoahDolev thanks for pointing out that there is another route by specifying the image. I am totally new to SageMaker and just began the work several days ago.

How could you create the image before specifying it in the function?

yuchuang1979 on 10 Jul 2019

Just some context.

There are two TensorFlow solutions that handle serving in the Python SDK.

They have different class representations and documentation as shown here.

TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47
Doc: https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#deploying-directly-from-model-artifacts
Key difference: Uses a proxy GRPC client to sent requests
Container impl: https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py
Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96
Doc: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst
Key difference: Utilizes the TensorFlow serving rest API
Container impl: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/container/sagemaker/serve.py

Sorry for the confusion and wall of text and links. Please let me know if there is anything I can clarify.

Thanks!

ChoiByungWook on 10 Jul 2019

👍8 🎉1

@ChoiByungWook This is quite clear. Thanks!

yuchuang1979 on 11 Jul 2019

@ChoiByungWook Thanks for your introduction! I am wondering when will tf 1.14 be supported for serving?

I tried cpu, gpu and elastic ones, but it seems the corresponding images are all not available:

The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.14-cpu' does not exist.

The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.14-gpu' does not exist.

I used your second one:

from sagemaker import get_execution_role
from sagemaker.tensorflow.serving import Model
role = get_execution_role()

sagemaker_model = Model(model_data = 's3://sagemaker-hover/Models/zulu/tpu/model.tar.gz',
                        role = role,
                        framework_version='1.14')
predictor = sagemaker_model.deploy(initial_instance_count=1, 
                                   instance_type='ml.p2.xlarge',
                                   endpoint_name='test-001')

And also for the TensorFlowModel module, it seems it only supports until tf 1.12.

panfeng-hover on 24 Jul 2019

👍1

We have to use the proxy server with circle to run this.

tomislavmitic2012 on 21 Jan 2020

Did the format for specifying images change after TensorFlow 2 support was added? Or are there just no pre-built images for TensorFlow frameworks 2.0 and 2.1? I get

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-cpu-py2' does not exist..

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-gpu-py2' does not exist..

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-cpu-py3' does not exist..

UnexpectedStatusException: Error hosting endpoint sagemaker-tensorflow-2020-04-13-14-02-35-992: Failed. Reason:  The image '520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tensorflow:2.1.0-gpu-py3' does not exist..

When trying to specify

from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
                                  role = role,
                                  framework_version = '2.1.0',
                                  entry_point = 'train.py')

in the sample notebook available at https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/.

KeeleyDonovan on 13 Apr 2020

@ChoiByungWook The container implementation code locations given above (for TensorflowModel & Model) are outdated. Can you please point to the current implementations?

ratulray on 6 May 2020

@keelerh @ratulray I believe the class you're looking for is sagemaker.tensorflow.serving.Model (the second one that @ChoiByungWook mentioned): https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-serving-model. That class should retrieve the correct image URI for the TF 2.x images.

if you have any further questions, please open a new issue (it'll help with our internal tracking)