Sagemaker-python-sdk: Error when deploying custom model via `MXNetModel.deploy`

Created on 30 Aug 2019  路  3Comments  路  Source: aws/sagemaker-python-sdk

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): MXNet / MXNetModel
  • Framework Version: 1.4.1
  • Python Version: 3.6.7
  • CPU or GPU: N/A
  • Python SDK Version: 1.38.4
  • Are you using a custom image: No

Describe the problem

We obtain the following error when trying to deploy a custom MXNet model via MXNetModel.deploy:

Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:07:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sagemaker.mxnet.model import MXNetModel
>>> mxnet_model = MXNetModel(model_data=s3_data_bucket,
...                          role=role_arn,
...                          entry_point='inference.py',
...                          dependencies=['path/to/package'],
...                          framework_version='1.4.1',
...                          py_version='py3')
>>> 
>>> mxnet_model.deploy(instance_type='ml.c5.xlarge', initial_instance_count=1)

```
Traceback (most recent call last):
File "", line 1, in
File "/anaconda3/envs/xx/lib/python3.6/site-packages/sagemaker/model.py", line 426, in deploy
self._create_sagemaker_model(instance_type, accelerator_type, tags)
File "/anaconda3/envs/xx/lib/python3.6/site-packages/sagemaker/model.py", line 165, in _create_sagemaker_model
container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
File "/anaconda3/envs/xx/lib/python3.6/site-packages/sagemaker/mxnet/model.py", line 148, in prepare_container_def
self._upload_code(deploy_key_prefix, is_mms_version)
File "/anaconda3/envs/xx/lib/python3.6/site-packages/sagemaker/model.py", line 788, in _upload_code
sagemaker_session=self.sagemaker_session,
File "/anaconda3/envs/xx/lib/python3.6/site-packages/sagemaker/utils.py", line 413, in repack_model
model_dir, inference_script, source_directory, dependencies, sagemaker_session, tmp
File "/anaconda3/envs/xx/lib/python3.6/site-packages/sagemaker/utils.py", line 472, in _create_or_update_code_dir
shutil.copytree(dependency, code_dir, os.path.basename(dependency))
File "/anaconda3/envs/xx/lib/python3.6/shutil.py", line 315, in copytree
os.makedirs(dst)
File "/anaconda3/envs/xx/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/var/folders/3z/kqrl01qd2vv6248zcs56q0wh0000gn/T/tmpqbeoxi0t/model/code'


We strongly suspect the problem to be related to [line 472](https://github.com/aws/sagemaker-python-sdk/blob/9765de68ad8b776740d800148c861ca0e4794716/src/sagemaker/utils.py#L472) in `sagemaker/utils.py`:

```python
shutil.copytree(dependency, code_dir)

The problem is that code_dir already exists at that point and hence shutils.copytree throws an error (cf. shutils.copytree doc).

This problem only occurs when supplying the MXNetModel constructor with the parameter framework_version, which triggers a call to utils.repack_model from FrameworkModel._upload_code.

We believe the solution is to modify the aforementioned line to the following:

shutil.copytree(dependency, os.path.join(code_dir, os.path.basename(dependency)))

so that the dependency folder is copied into code_dir instead of trying to replace it.

Minimal repro / logs

To reproduce:

    1.
touch inference.py
mkdir -p path/to/package
touch path/to/package/utils.py
  1. On a Python console, run:
from sagemaker.mxnet.model import MXNetModel

s3_data_bucket = 'bucket_with_model_artifacts'
role_arn = 'role_arn'

mxnet_model = MXNetModel(model_data=s3_data_bucket,
                         role=role_arn,
                         entry_point='inference.py',
                         dependencies=['path/to/package'],
                         framework_version='1.4.1',
                         py_version='py3')

mxnet_model.deploy(instance_type='ml.c5.xlarge', initial_instance_count=1)
pending release bug

Most helpful comment

v1.39.1 has been released with this change. Hopefully this solves your issue!

All 3 comments

+1. It seems using framework_version='1.3.0' is OK, and I only get the error when specifying 1.4.1.

Thank you for the super-detailed bug report! I've submitted a PR here: https://github.com/aws/sagemaker-python-sdk/pull/1021

v1.39.1 has been released with this change. Hopefully this solves your issue!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zjost picture zjost  路  3Comments

cgarciae picture cgarciae  路  5Comments

jguo16 picture jguo16  路  3Comments

velociraptor111 picture velociraptor111  路  3Comments

sylvainrobbiano picture sylvainrobbiano  路  3Comments