Sagemaker-python-sdk: PyTorch 1.6.0 Inference packaging skips dependencies, other model artefacts

Created on 19 Sep 2020 · 11Comments · Source: aws/sagemaker-python-sdk

Describe the bug
Model artefacts packagedin model.tar.gz are skipped when the model object is converted into a TorchServe model. Similarly, dependencies included are also dropped.

To reproduce
Add any extra file to model.tar.gz that is not model.pth, and it won't show up in the container. Similarly, any of the extra depenencies specified during the initialization of a PyTorchModel object are dropped.

Expected behavior
Model package would include all of the extra files and dependencies, as described in the API documentation.

Screenshots or logs
Tried os.walk in the model_dir and realized artefacts I was expecting were missing. The below log shows the files in model_dir. I had an extra .pkl object in there which was not included.

2020-09-18 23:50:40,059 [WARN ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - 2020-09-18 23:50:40.058 | INFO | inference:model_fn:59 - ['inference.py', 'handler_service.py', 'model.pth']

Similarly, the directories log is missing the extra dependencies specified while creating the PyTorchModel object under model_dir/lib (#1832 another bug) or model_dir (as specified in API documentation) both.

2020-09-18 23:50:40,058 [WARN ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - 2020-09-18 23:50:40.058 | INFO | inference:model_fn:58 - ['__pycache__', 'MAR-INF']

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.9.1
Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
Framework version: 1.6.0
Python version: 3.6
CPU or GPU: GPU
Custom Docker image (Y/N): N

Additional context

The problem arises from this line in the process of TorchServe packaging, that is a result of https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/79/. It drops every other object as a part of the repackaging except the inference script.

Clearly, that is a regression and unexpected.

bug

Source

setu4993

👍1

Most helpful comment

@setu4993, I have experienced similar issues with that. From the source code, I believe the following line might be the one that's causing what you are experiencing (and me as well):

https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/9a6869ea3af1ebf9292da0a8d752b0c3389ecdec/src/sagemaker_pytorch_serving_container/torchserve.py#L125

This makes inference.py the only (custom) py file in the model directory. I believe this container should support adding extra/custom artifacts (e.g., model companion objects, model architecture).

Furthermore, Sagemaker Python SDK example with PyTorch (with the code here) does not work with this serving container (because the model has a companion object).

jonsnowseven on 14 Oct 2020

👍5

All 11 comments

Hi @setu4993,

Could you provide the python sdk code and source_dir structure so we could rootcause the issue?

otter-bunny on 21 Sep 2020

A short example to reproduce the issue (X and Y are other directories on the file system):

sagemaker_model = PyTorchModel(
    model_data="model.tar.gz",
    source_dir="/root/",
    entry_point="inference.py",
    dependencies=[
        "X",
        "Y",
    ],
    framework_version="1.6.0",
    py_version="py3",
)

After the re-packaging step occurs and a model.tar.gz is created on S3 at the code_location:

├── code
│   ├── README.md
│   ├── inference.py
│   ├── lib
│   │   ├── X
│   │   │   ├── __init__.py
│   │       ├── ...
│   │   └── Y
│   │       ├── README.md
│   │       ├── __init__.py
│   │       ├── ...
│   └── requirements.txt
├── model.pth
└── another_model.pkl

But, at runtime, the only files and directories in the model_dir the container are: inference.py, model.pth and handler_service.py (see logs in the opening comment).

setu4993 on 21 Sep 2020

The problem here is multi-fold:

Why does an extra repackaging step occur that creates a difference between what is made available on S3 at code_location and what is in the final inference service? This further makes it more difficult to understand (and diagnose) what's happening on the container since the model tarball produced by SageMaker drifts away from the expected (which was repackaged by the SDK!).
Why is everything being dropped from the container?

setu4993 on 21 Sep 2020

👍1

@ChuyangDeng : Any update on this?

setu4993 on 24 Sep 2020

@setu4993, I have experienced similar issues with that. From the source code, I believe the following line might be the one that's causing what you are experiencing (and me as well):

https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/9a6869ea3af1ebf9292da0a8d752b0c3389ecdec/src/sagemaker_pytorch_serving_container/torchserve.py#L125

This makes inference.py the only (custom) py file in the model directory. I believe this container should support adding extra/custom artifacts (e.g., model companion objects, model architecture).

Furthermore, Sagemaker Python SDK example with PyTorch (with the code here) does not work with this serving container (because the model has a companion object).

jonsnowseven on 14 Oct 2020

👍5

Thanks @jonsnowseven. +1, I think I found the same line earlier :).

In my case, it all works with 1.5.0, so I'm sticking to it and updating to 1.6 with requirements.txt.

setu4993 on 14 Oct 2020

After almost 4 months from the last comment, this bug still persists....

ldong87 on 5 Feb 2021

Yes @ldong87. Any news regarding this?

jonsnowseven on 5 Feb 2021

Same issue here. Have been trying to work around this for 2 days now with no luck.

dectl on 18 Feb 2021

Wanted to share that @amaharek found a potential solution and posted it on the supplementary issue on the sagemaker-pytorch-inference-toolkit repo here: https://github.com/aws/sagemaker-pytorch-inference-toolkit/issues/85#issuecomment-780621393

setu4993 on 18 Feb 2021

👍1

@dectl One (ugly) workaround is to copy/load the missing model files from /opt/ml/model/, which apparently is a documented decompressed model location: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-load-artifacts