Describe the bug
Model artefacts packagedin model.tar.gz are skipped when the model object is converted into a TorchServe model. Similarly, dependencies included are also dropped.
To reproduce
Add any extra file to model.tar.gz that is not model.pth, and it won't show up in the container. Similarly, any of the extra depenencies specified during the initialization of a PyTorchModel object are dropped.
Expected behavior
Model package would include all of the extra files and dependencies, as described in the API documentation.
Screenshots or logs
Tried os.walk in the model_dir and realized artefacts I was expecting were missing. The below log shows the files in model_dir. I had an extra .pkl object in there which was not included.
2020-09-18 23:50:40,059 [WARN ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - 2020-09-18 23:50:40.058 | INFO | inference:model_fn:59 - ['inference.py', 'handler_service.py', 'model.pth']
Similarly, the directories log is missing the extra dependencies specified while creating the PyTorchModel object under model_dir/lib (#1832 another bug) or model_dir (as specified in API documentation) both.
2020-09-18 23:50:40,058 [WARN ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - 2020-09-18 23:50:40.058 | INFO | inference:model_fn:58 - ['__pycache__', 'MAR-INF']
System information
A description of your system. Please provide:
Additional context
The problem arises from this line in the process of TorchServe packaging, that is a result of https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/79/. It drops every other object as a part of the repackaging except the inference script.
Clearly, that is a regression and unexpected.
Hi @setu4993,
Could you provide the python sdk code and source_dir structure so we could rootcause the issue?
A short example to reproduce the issue (X and Y are other directories on the file system):
sagemaker_model = PyTorchModel(
model_data="model.tar.gz",
source_dir="/root/",
entry_point="inference.py",
dependencies=[
"X",
"Y",
],
framework_version="1.6.0",
py_version="py3",
)
After the re-packaging step occurs and a model.tar.gz is created on S3 at the code_location:
โโโ code
โย ย โโโ README.md
โย ย โโโ inference.py
โย ย โโโ lib
โย ย โย ย โโโ X
โย ย โย ย โย ย โโโ __init__.py
โย ย โย ย โโโ ...
โย ย โย ย โโโ Y
โย ย โย ย โโโ README.md
โย ย โย ย โโโ __init__.py
โย ย โย ย โโโ ...
โย ย โโโ requirements.txt
โโโ model.pth
โโโ another_model.pkl
But, at runtime, the only files and directories in the model_dir the container are: inference.py, model.pth and handler_service.py (see logs in the opening comment).
The problem here is multi-fold:
code_location and what is in the final inference service? This further makes it more difficult to understand (and diagnose) what's happening on the container since the model tarball produced by SageMaker drifts away from the expected (which was repackaged by the SDK!).@ChuyangDeng : Any update on this?
@setu4993, I have experienced similar issues with that. From the source code, I believe the following line might be the one that's causing what you are experiencing (and me as well):
This makes inference.py the only (custom) py file in the model directory. I believe this container should support adding extra/custom artifacts (e.g., model companion objects, model architecture).
Furthermore, Sagemaker Python SDK example with PyTorch (with the code here) does not work with this serving container (because the model has a companion object).
Thanks @jonsnowseven. +1, I think I found the same line earlier :).
In my case, it all works with 1.5.0, so I'm sticking to it and updating to 1.6 with requirements.txt.
After almost 4 months from the last comment, this bug still persists....
Yes @ldong87. Any news regarding this?
Same issue here. Have been trying to work around this for 2 days now with no luck.
Wanted to share that @amaharek found a potential solution and posted it on the supplementary issue on the sagemaker-pytorch-inference-toolkit repo here: https://github.com/aws/sagemaker-pytorch-inference-toolkit/issues/85#issuecomment-780621393
@dectl One (ugly) workaround is to copy/load the missing model files from /opt/ml/model/, which apparently is a documented decompressed model location: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-load-artifacts
Most helpful comment
@setu4993, I have experienced similar issues with that. From the source code, I believe the following line might be the one that's causing what you are experiencing (and me as well):
https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/9a6869ea3af1ebf9292da0a8d752b0c3389ecdec/src/sagemaker_pytorch_serving_container/torchserve.py#L125
This makes
inference.pythe only (custom) py file in the model directory. I believe this container should support adding extra/custom artifacts (e.g., model companion objects, model architecture).Furthermore, Sagemaker Python SDK example with PyTorch (with the code here) does not work with this serving container (because the model has a companion object).