Sagemaker-python-sdk: Training a model in `local_code` mode does not work if `source_dir="."`

Created on 12 Dec 2018  ·  12Comments  ·  Source: aws/sagemaker-python-sdk

System Information

  • Python Version: 3.6
  • Python SDK Version: master
  • Are you using a custom image: yes

Describe the problem

I am trying to train a model using the undocumented local_code mode. In case I don't specify source_dir or set it to "." the training procedure fails to mount the volumes correctly.

I get:

Cannot create container for service algo-1-JFP46: create .: volume name is too short, names should be at least two alphanumeric characters

I am reporting this even if local_code is still not documented, hoping it can be useful anyway.

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

session = LocalSession()
session.config = {'local': {'local_code': True}}

est = MyEstimator(
    entry_point='code.py',
    train_instance_type='local',
    train_instance_count=1,
    role=role,
    sagemaker_session=session,
)

est.fit()

See the full traceback.

Here is the interesting part of the generated docker-compose.yaml:

networks:
  sagemaker-local:
    name: sagemaker-local
services:
    volumes:
    - /tmp/tmp_i5dhjtn/algo-1-JFP46/output/data:/opt/ml/output/data
    - /tmp/tmp_i5dhjtn/algo-1-JFP46/output:/opt/ml/output
    - /tmp/tmp_i5dhjtn/algo-1-JFP46/input:/opt/ml/input
    - /tmp/tmp_i5dhjtn/model:/opt/ml/model
    - :/opt/ml/code
    - /tmp/tmp_i5dhjtn/shared:/opt/ml/shared
bug documentation feature request

Most helpful comment

I solved it using absolute path for the entry_point

entry_point=str(Path.cwd()/'train.py')

All 12 comments

Hello @tyrion,

Thank you bringing this to our attention.

I'll speak with the team in regards to handling this situation and the fix needed.

Thanks again!

@ChoiByungWook Any news regarding this issue?

any update ?

Any update? Same issue here

Using Pytorch local mode 、 encountered same issue!

Same issue when I'm trying to use local mode / local code currently.

Same issue and expecting any help. I am running TF framework.

Is there any progress on this issue?

I had this problem and have set absolute code path to source_dir then volume mounted properly in local

I solved it using absolute path for the entry_point

entry_point=str(Path.cwd()/'train.py')

Ran into this problem on windows. It appears sagemaker doesn't parse the path correctly when creating the docker volume definitions, which leads to an incorrect mapping for the /opt/ml/code directory. Here's a drop in solution:

from pathlib import Path
import sagemaker, platform

class VolumeOverride(sagemaker.local.image._Volume):
    '''
    Windows paths aren't handled correctly by sagemaker when creating 
    the dockerfile volume definitions. As one example, see line 370 of
    sagemaker.local.image:
        | volumes.append(_Volume(parsed_uri.path, "/opt/ml/code"))
    The parsed_uri cuts off the path drive label (stored in parsed_uri.netloc),
    leading to an empty mount in the docker image.

    Here, we override the appropriate sagemaker class in order to 
    allow correct source_dir mounting. Unfortunately, the correct 
    drive information is not passed to the _Volume class, so we assume
    it is the same as the root drive for the current working directory.
    '''

    def __init__(self, *args, **kwargs):
        super(VolumeOverride, self).__init__(*args, **kwargs)

        if platform.system() == 'Windows' and str(self.container_dir) == '/opt/ml/code':
            self.map = f'{Path.cwd().drive}{self.host_dir}:{self.container_dir}'

sagemaker.local.image._Volume = VolumeOverride

To use it, just put it in a file called sagemaker_windows_fix.py, then import sagemaker_windows_fix in your main file. Make sure you pass the full path of your code directory to source_dir.

I am getting this same error.

I solved it using absolute path for the entry_point

entry_point=str(Path.cwd()/'train.py')

This worked for me. It needs the relative and not the absolute path to the file for local mode to work.

Was this page helpful?
0 / 5 - 0 ratings