Sagemaker-python-sdk: RegisterModel or TensorFlowModel.register lost the inference behavior

Created on 3 Feb 2021 · 16Comments · Source: aws/sagemaker-python-sdk

Describe the bug
I am using the template project and intend to create and deploy a keras model which has customized preprocessing script. In order to do that I squeeze the preprocessing script inside the inference.py as described in https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html, created model works as expected when I tried to use TensorFlowModel model independently, but when I deploy from the model package arn which registered from pipeline, it seems not behaving as expected.

To reproduce
My pipeline looks like this

- step_data_ingest
- step_feature_engineering
- step_data_validation
- step_train (shows below)
- output_path = f"{experiment_package_s3_dir}/model"

    xx_estimator = TensorFlow(
        entry_point='train.py',
        source_dir='pipelines/abalone/src', # this should be just "source" for your code
        role=role,
        image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.1-cpu-py37-ubuntu18.04",
        instance_count=1,
        model_dir=False,
        instance_type=training_instance_type,
        output_path = output_path, # all training step output (include debug etc.)
        sagemaker_session=sagemaker_session,
        container_log_level=10, # 10 debug 20 info 30 warning 40 error
        base_job_name=f"{base_job_prefix}-model-train",
        hyperparameters={
            "epochs": 5,
            "batch_size": 256,
            "early_stop_patience": 10,
            "country": country
        }
    )
    step_train = TrainingStep(
        name="TrainxxModel",
        estimator=xx_estimator,
        inputs={
            "train": TrainingInput(
                s3_data=step_data_ingest.properties.ProcessingOutputConfig.Outputs[
                    "train"
                ].S3Output.S3Uri,
                content_type=None,
            ),
            "validation": TrainingInput(
                s3_data=step_data_ingest.properties.ProcessingOutputConfig.Outputs[
                    "validation"
                ].S3Output.S3Uri,
                content_type=None,
            ),
            "encoders": TrainingInput(
                s3_data=step_feature_engineering.properties.ProcessingOutputConfig.Outputs[
                    "encoders"
                ].S3Output.S3Uri,
                content_type=None,
            ),
        },
    )

- step_model_eval
- step_register

    step_register = RegisterModel(
        name="RegisterxxModel",
        estimator=xx_estimator,
        model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
        image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04",
        content_types=["application/json", "text/csv"],
        response_types=["application/json", "text/csv"],
        inference_instances=["ml.m5.large"],
        transform_instances=["ml.m5.large"],
        model_package_group_name=model_package_group_name,
        approval_status=model_approval_status,
        source_dir='pipelines/abalone/src',
        entry_point="inference.py",
        model_metrics=salary_model_metrics,
        role=role,
    )

inside the inference.py, I have defined several setup outside of the handler method, looks like this

import xxx

setup encoders
load model

def handler
   process input
   model predict
   processing output

def _processing_input
    encoders do feature transform

def _processing_output

Expected behavior
The registerred model will take input, log and execute the feature transform and give back proper prediction after being deployed

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

if I deploy model this way, it works

## A. The working one
xx_model = TensorFlowModel(
    name="xxDebugModel",
    entry_point="inference.py",
    source_dir="../pipelines/abalone/src",
    image_uri=image_uri,
    model_data=model_data,
    sagemaker_session=sagemaker.Session(),
    container_log_level=0,
    role=role,
)
display(xx_model.__dict__)

These following ways won't work

## B. deploy from package created from xx_model.register
model_package = xx_model.register(
    content_types=["application/json", "text/csv"],
    response_types=["application/json", "text/csv"],
    inference_instances=["ml.m5.large"],
    transform_instances=["ml.m5.large"],
    model_package_group_name="xxPipeModelGroup",
    approval_status="Approved",
    model_metrics = xx_model_metrics,
    description=f"Registered from XX Model Experiement Pack: {experiment_package_s3_dir}"
)
model_package.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    endpoint_name="xxDebugModelPackDeploy"
)

## C. deploy from package registerred after pipeline, these two methods shows same behavior
model_package_from_arn = sagemaker.ModelPackage(
    role=role,
    model_package_arn = "arn:aws:sagemaker:us-east-1:104436464649:model-package/xxx-us/5"
)
model_package_from_arn.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    endpoint_name="xxDebugModelPackARNDeploy"
)

A shows logs like this

// [from tensorflow serving] loading model
// [from inference.py] load encoders
// [from inference.py] setting up
// [from tensorflow serving] entering event loop (ping...ping...)

B and C shows similar logs

// [from tensorflow serving] loading model
// [from tensorflow serving] entering event loop (ping...ping...)

System information
A description of your system. Please provide:

SageMaker Python SDK version:
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
Framework version: 2.3
Python version: 3.7
CPU or GPU: CPU
Custom Docker image (Y/N): N

Additional context

For privacy, I masked the variable names, it may be having typos, but you can assume all of them are correct.
I don't know how to find the log at the instance beginning time, if someone can help I can grab those logs

Other issues found:

I can't initiate CreateModelStep through TensorFlowModel, shoots an error about S3ModelArtifacts
When I try to deploy the model using A way, I have to do that twice, the first time it shoots back an error saying I used duplicated tags for sagamaker-project or whatever, and I can't mask that out.

bug documentation

Source

ruyyi0323

All 16 comments

Someone can help?

ruyyi0323 on 2 Mar 2021

Hey @ruyyi0323,

Apologies on the late response.

For clarification purposes, are you following https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html for setting up your pipeline for registering and deploying your Keras model?

ChoiByungWook on 2 Mar 2021

Hi @ChoiByungWook

Thanks for helping!
I am doing this way

# A.
salary_model = TensorFlowModel(
    name="SalaryDebugModel",
    entry_point="inference.py",
    source_dir=base_dir,
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04",
    model_data=model_data,
    sagemaker_session=sagemaker.Session(),
    container_log_level=10,
    role=role,
)
display(salary_model.__dict__)

salary_model.deploy(
    instance_type="ml.m5.2xlarge", 
    initial_instance_count=1, 
    endpoint_name="SalaryDebugModel", 
    update_endpoint=True, 
    tags=None
)

# B.
model_package = None
model_package = salary_model.register(
    content_types=["application/json", "text/csv"],
    response_types=["application/json", "text/csv"],
    inference_instances=["ml.m5.4xlarge"],
    transform_instances=["ml.m5.4xlarge"],
    model_package_group_name="SalaryDebugModelGroup",
    approval_status="Approved",
    description="DEBUG"
)
model_package.__dict__
model_package.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.4xlarge",
    endpoint_name="SalaryDebugModelPackARNDeploy"
)

A is working perfectly fine and B is not working, seems like B didn't load the entry point somehow

B doesn't have the log in red box, and the prediction is not functioning, A get's correct prediction behavior

Plus: B is facing duplicate Tag issue, I have to run same cell for twice before I have the model gets deployed.

ruyyi0323 on 2 Mar 2021

👍1

@ruyyi0323,

I apologize for the bad experience.

I believe you are running into these issues as the SageMaker deep learning frameworks containers dynamically load userscripts (inference.py) in conjunction with environment variables mapping to a S3 object. Your model package gets pulled from S3 and your inference.py file gets packed into a new tar file, as shown here: https://github.com/aws/sagemaker-python-sdk/blob/e08c04e6ed0fdfb7e9e873d119769509f3ed74de/src/sagemaker/utils.py#L362. Thus, in your A version of the deployment, this all passes, as inference.py is inside the container as expected: https://github.com/aws/sagemaker-tensorflow-serving-container#prepost-processing.

I believe if you attempt to use the repacked model data of the successfully deployed A version model in your registration it should work.

// Version A
xx_model = TensorFlowModel()
xx_model.deploy()
repacked_model_data = xx_model.repacked_model_data

// Version B
xx_model.model_data = repacked_model_data
xx_model.register()
...

// Create new Model object
new_xx_model = TensorFlowModel(model_data=repacked_model_data, ...)
new_xx_model.register()

If you don't like that approach, you can also attempt to repack the model on your local and create the new model object as shown above. You would need to follow the code directory shown in:

// Example, with tar file called model.tar.gz
model.tar.gz
>>> model
>>>     code |------ inference.py
>>>     |------ saved_model.pb

If you would prefer baking the scripts into the image instead of loading it in your model package.

You would have to either extend or commit a new Docker image with your inference.py script being baked in.

Quick overview for creating a new algorithm container in SageMaker: https://github.com/aws/amazon-sagemaker-examples/blob/master/aws_marketplace/creating_marketplace_products/Bring_Your_Own-Creating_Algorithm_and_Model_Package.ipynb

For extending an existing image see: https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb

The image to extend would be: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.1-cpu-py37-ubuntu18.04

ChoiByungWook on 2 Mar 2021

Hi @ChoiByungWook ,

Thanks for the explaination, I am currently trying out the sagemaker workflow pipeline thus I am trying my best to mimic the behavior of the repack model and register to see why the RegisterStep failed to build up the model with corresponding prediction behavior. Which is C way that I mentioned in original thread.

If the utils.py way you've mentioned basically replicate the repack model step? I can do that to try out.

ruyyi0323 on 2 Mar 2021

Hi @ChoiByungWook ,

Thanks for the explaination, I am currently trying out the sagemaker workflow pipeline thus I am trying my best to mimic the behavior of the repack model and register to see why the RegisterStep failed to build up the model with corresponding prediction behavior. Which is C way that I mentioned in original thread.

If the utils.py way you've mentioned basically replicate the repack model step? I can do that to try out.

Yes, you should be able to call the repack_model function, however it does have a few parameters. When you do deploy with the TensorFlowModel object it calls that as well: https://github.com/aws/sagemaker-python-sdk/blob/e08c04e6ed0fdfb7e9e873d119769509f3ed74de/src/sagemaker/tensorflow/model.py#L320-L328

ChoiByungWook on 2 Mar 2021

Thanks so much, will do a quick PoC tomorrow

ruyyi0323 on 2 Mar 2021

😄1

Awesome! For reference, feel free to pull down the model object that is associated with your successful endpoint (version A) in the AWS console. When you download the tar file from S3, it should show you your pb model and inference.py inside.

AWS Console -> SageMaker -> Endpoint -> Endpoint configuration settings -> Model Name -> Container 1 -> Model data location

Or you can check out the repacked_model_data object in your localhost.

// Version A
print(xx_model.repacked_model_data)

ChoiByungWook on 2 Mar 2021

Hi, @ChoiByungWook I am trying to inspect the repacked_model_data after I do model.deploy(..), however I am getting NoneType, is this a bug here?

ruyyi0323 on 2 Mar 2021

Hey @ruyyi0323,

My apologies.

It looks like we have some inconsistency with how we track the repacked_model_data within the base Model class: https://github.com/aws/sagemaker-python-sdk/blob/c4d71f5639697d833e5578016a3f45402a4a80de/src/sagemaker/model.py#L1131
and the TensorFlowModel class: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L294-L332

I believe to find the repacked model you will need to look at the model defined in your endpoint in the AWS Console.

ChoiByungWook on 2 Mar 2021

Hi @ChoiByungWook , I have tried couple experiments to see what's going on, not sure if that helps for you or anyone else, I will just go put some information as the reference here

Illustration of whole PoC

[Success] A1. create TensorFlowModel object using the model.tar.gz from TrainingStep output and deploy

This one works, by checking the file structure after deploy, I am having

./code/inference.py
./code/{other files}
./model/1/{model_assets, including pb or whatever}

salary_model = TensorFlowModel(
    name="SalaryDebugModel",
    entry_point="inference.py",
    source_dir=base_dir,
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04",
    model_data=model_data,
    sagemaker_session=sagemaker.Session(),
    container_log_level=10,
    role=role,
)
salary_model.deploy(..)



md5-4e61f2ffdf8133208d50fe8d538a9ffc



from sagemaker.utils import (
    repack_model
)
repacked_salary_model = TensorFlowModel(
    name="SalaryDebugModelRepack",
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04",
    model_data=salary_model.repacked_model_data
    sagemaker_session=sagemaker.Session(),
    container_log_level=10,
    role=role
)
repacked_salary_model.deploy(..)



md5-2e170520226bad7ee11b9979cf7ebd37



repacked_salary_model = TensorFlowModel(
    name="SalaryDebugModelRepack",
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04",
    model_data="s3://sagemaker-us-east-1-104436464649/SalaryDebugModel/model.tar.gz", # repacked file
    sagemaker_session=sagemaker.Session(),
    container_log_level=10,
    role=role
)
repacked_salary_model.deploy(..)



md5-c40f9053fb5eede42acf3e9b10d596ca



model_package_1 = salary_model.register(
    content_types=["application/json", "text/csv"],
    response_types=["application/json", "text/csv"],
    inference_instances=["ml.m5.4xlarge"],
    transform_instances=["ml.m5.4xlarge"],
    model_package_group_name="SalaryDebugModelGroup",
    approval_status="Approved",
    description="DEBUG"
)
model_package_1.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.4xlarge",
    endpoint_name="SalaryDebugModelPackARNDeploy",
    wait=False
)



md5-d05007bf4fefc74254ebb83468f4c010



model_package_1 = salary_model.register(
    content_types=["application/json", "text/csv"],
    response_types=["application/json", "text/csv"],
    inference_instances=["ml.m5.4xlarge"],
    transform_instances=["ml.m5.4xlarge"],
    model_package_group_name="SalaryDebugModelGroup",
    approval_status="Approved",
    description="DEBUG"
)
model_package_1.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.4xlarge",
    endpoint_name="SalaryDebugModelPackARNDeploy",
    wait=False
)



md5-3add8f4b82b48875eb0c9618856be8ab



./code/{both train and inference scripts}
./code/{other scripts}
./{model_assets}
./{other reference files}



md5-4dc9d4aed5b89e33e5480c7cd319c34d



salary_estimator = TensorFlow(
        entry_point='train.py',
        source_dir="pipelines/abalone/src", # this should be just "source" for your code
        role=role,
        image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04",
        instance_count=1,
        instance_type=training_instance_type,
        output_path = get_projection_s3_dir(experiment_name, "model"),
        model_dir = False,
        sagemaker_session=sagemaker_session,
        container_log_level=10, # 10 debug 20 info 30 warning 40 error
        volume_size=160,
        base_job_name=f"{base_job_prefix}-model-train",
        hyperparameters={
            "epochs": train_epochs,
            "batch_size":train_batch_size,
            "early_stop_patience": early_stop_tolerance,
        },

# trainstep

step_register = RegisterModel(
        name="RegisterSalaryModel",
        estimator=salary_estimator,
        model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
        image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.1-cpu-py37-ubuntu18.04",
        content_types=["application/json", "text/csv"],
        response_types=["application/json", "text/csv"],
        inference_instances=_available_inference_instances,
        transform_instances=_available_transform_instances,
        model_package_group_name=model_package_group_name,
        approval_status=model_approval_status,
        model_metrics=salary_model_metrics,
        role=role,
    )

Summary

My theory is the _repackModel step is somehow repacking in a wrong way which makes the deployment lost the inference.py behavior.
As a workaround,
- we may change the trainstep to force copy the inference code into the /opt/ml/model/code folder, suggest the src to have training codes and inference codes
- change the RegisterStep, DO NOT put entry_point and source_dir, so that it won't trigger the sagemaker repacking.

Other Issues Came Across

Duplicate Tag Issues (happens nearly everytime I am calling the .deploy function, and I have no idea how to force name or remove the tags)
No TensorFlow issue for inference.py (I am importing tensorflow to do some other things)

ruyyi0323 on 3 Mar 2021

👍1

@ruyyi0323,

Thank you so much for all of this!

For option A2, if you're calling repack_model directly, you can use the parameter you passed in for repacked_model_uri as your model_data parameter in your TensorFlowModel constructor.

ChoiByungWook on 3 Mar 2021

👍1

@ruyyi0323,

Thank you so much for all of this!

For option A2, if you're calling repack_model directly, you can use the parameter you passed in for repacked_model_uri as your model_data parameter in your TensorFlowModel constructor.

Thanks @ChoiByungWook, Haven't try that out yet but I believe that would work also.

ruyyi0323 on 3 Mar 2021

Hello. I tried to follow your _[Success] B2_ recipe and it did not work for me. The estimator has both entry_point and source_dir parameters. The source dir contains both training and inference py files. The RegisterModel step uses both the estimator and the training S3 model artefact. However, this resulting S3 model artefact has no code source dir with inference.py. It was not packaged into the model during the training step. Do I need a model step before registering? When does repackaging happen?


tf_estimator = TensorFlow(
    entry_point='tf_train.py',
    source_dir='code', 
    role=role,
    framework_version='2.4.1',
    model_dir=False,
    py_version='py37',
    instance_type='ml.m5.large',
    instance_count=1,
    output_path=output_path,
)

register_step = RegisterModel(
    name="RegisterModel",
    estimator=tf_estimator,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    image_uri=f"763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.3-cpu",
    content_types=["application/json"],
    response_types=["application/json"],
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name=model_name_param,
)

ievgen-goichuk-rft on 22 Apr 2021

Hi,

Yeah you need to add that part in your training step, or you initiate another training job to pack your model. The model artifact file that you used for final registry should contains the code file and model files.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Ievgen Goichuk @.>
Sent: Thursday, April 22, 2021 9:25:38 AM
To: aws/sagemaker-python-sdk *@.>
Cc: Chen Liang @.>; State change @.*>
Subject: Re: [aws/sagemaker-python-sdk] RegisterModel or TensorFlowModel.register lost the inference behavior (#2123)

External Email

Hello. I tried to follow your [Success] B2 recipe and it did not work for me. The estimator has both entry_point and source_dir parameters. The source dir contains both training and inference py files. The RegisterModel step uses both the estimator and the training S3 model artefact. However, this resulting S3 model artefact has no code source dir with inference.py. It was not packaged into the model during the training step. Do I need a model step before registering? When does repackaging happen?

tf_estimator = TensorFlow(
entry_point='tf_train.py',
source_dir='code',
role=role,
framework_version='2.4.1',
model_dir=False,
py_version='py37',
instance_type='ml.m5.large',
instance_count=1,
output_path=output_path,
)

register_step = RegisterModel(
name="RegisterModel",
estimator=tf_estimator,
model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
image_uri=f"763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.3-cpu",
content_types=["application/json"],
response_types=["application/json"],
inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
transform_instances=["ml.m5.xlarge"],
model_package_group_name=model_name_param,
)

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHubhttps://github.com/aws/sagemaker-python-sdk/issues/2123#issuecomment-824839337, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHSHMKMLIESJDY2IYSCH3ULTKAPVFANCNFSM4XBOX7UA.

ruyyi0323 on 22 Apr 2021

From the given code snippet, if you look at the model.tar.gz from output_path or from your TrainStep output, you should be able to see code+model file in your model.tar.gz.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Chen Liang @.>
Sent: Thursday, April 22, 2021 11:29:03 AM
To: aws/sagemaker-python-sdk *@.>; aws/sagemaker-python-sdk @.>
Cc: State change @.*>
Subject: Re: [aws/sagemaker-python-sdk] RegisterModel or TensorFlowModel.register lost the inference behavior (#2123)