I am trying to train a model using the undocumented local_code mode. In case I don't specify source_dir or set it to "." the training procedure fails to mount the volumes correctly.
I get:
Cannot create container for service algo-1-JFP46: create .: volume name is too short, names should be at least two alphanumeric characters
I am reporting this even if local_code is still not documented, hoping it can be useful anyway.
Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
session = LocalSession()
session.config = {'local': {'local_code': True}}
est = MyEstimator(
entry_point='code.py',
train_instance_type='local',
train_instance_count=1,
role=role,
sagemaker_session=session,
)
est.fit()
See the full traceback.
Here is the interesting part of the generated docker-compose.yaml:
networks:
sagemaker-local:
name: sagemaker-local
services:
volumes:
- /tmp/tmp_i5dhjtn/algo-1-JFP46/output/data:/opt/ml/output/data
- /tmp/tmp_i5dhjtn/algo-1-JFP46/output:/opt/ml/output
- /tmp/tmp_i5dhjtn/algo-1-JFP46/input:/opt/ml/input
- /tmp/tmp_i5dhjtn/model:/opt/ml/model
- :/opt/ml/code
- /tmp/tmp_i5dhjtn/shared:/opt/ml/shared
Hello @tyrion,
Thank you bringing this to our attention.
I'll speak with the team in regards to handling this situation and the fix needed.
Thanks again!
@ChoiByungWook Any news regarding this issue?
any update ?
Any update? Same issue here
Using Pytorch local mode 、 encountered same issue!
Same issue when I'm trying to use local mode / local code currently.
Same issue and expecting any help. I am running TF framework.
Is there any progress on this issue?
I had this problem and have set absolute code path to source_dir then volume mounted properly in local
I solved it using absolute path for the entry_point
entry_point=str(Path.cwd()/'train.py')
Ran into this problem on windows. It appears sagemaker doesn't parse the path correctly when creating the docker volume definitions, which leads to an incorrect mapping for the /opt/ml/code directory. Here's a drop in solution:
from pathlib import Path
import sagemaker, platform
class VolumeOverride(sagemaker.local.image._Volume):
'''
Windows paths aren't handled correctly by sagemaker when creating
the dockerfile volume definitions. As one example, see line 370 of
sagemaker.local.image:
| volumes.append(_Volume(parsed_uri.path, "/opt/ml/code"))
The parsed_uri cuts off the path drive label (stored in parsed_uri.netloc),
leading to an empty mount in the docker image.
Here, we override the appropriate sagemaker class in order to
allow correct source_dir mounting. Unfortunately, the correct
drive information is not passed to the _Volume class, so we assume
it is the same as the root drive for the current working directory.
'''
def __init__(self, *args, **kwargs):
super(VolumeOverride, self).__init__(*args, **kwargs)
if platform.system() == 'Windows' and str(self.container_dir) == '/opt/ml/code':
self.map = f'{Path.cwd().drive}{self.host_dir}:{self.container_dir}'
sagemaker.local.image._Volume = VolumeOverride
To use it, just put it in a file called sagemaker_windows_fix.py, then import sagemaker_windows_fix in your main file. Make sure you pass the full path of your code directory to source_dir.
I am getting this same error.
I solved it using absolute path for the entry_point
entry_point=str(Path.cwd()/'train.py')
This worked for me. It needs the relative and not the absolute path to the file for local mode to work.
Most helpful comment
I solved it using absolute path for the entry_point
entry_point=str(Path.cwd()/'train.py')