Sagemaker-python-sdk: ValueError: Error for Training job <training job name>: Failed Reason: AlgorithmError: ExecuteUserScriptError: Command "/usr/bin/python3 -m train

Created on 2 Aug 2019 · 3Comments · Source: aws/sagemaker-python-sdk

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): SKLearn
Framework Version: N/A
Python Version: 3.5 (?)
CPU or GPU: CPU
Python SDK Version: Not sure
Are you using a custom image: Yes, I think so

Describe the problem

I am trying to train an estimator on training data that I have stored on S3. I have tried with/without using the hyperparameter arguments in estimator definition and train.py file. I am trying to implement an SKLearn estimator. The train.py file is for my entrypoint. I have successfully trained similar estimators using PyTorch, but am not having success with SKLearn. Are there compatibility issues? Any suggestions? Also, what is the best way to go about debugging a script file? All of the code is below.

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

minimal repro: https://github.com/AmiriMc/SageMaker_SKLearn_estimator_errors

Exact command to reproduce:
Estimator code from that I run from Jupyter notebook (this runs fine):

from sagemaker.sklearn.estimator import SKLearn
from sklearn.base import BaseEstimator
from sagemaker import get_execution_role
role = get_execution_role()
output_path='s3://{}/{}'.format(bucket, prefix)

scikit_estimator = SKLearn(role=role,
entry_point='train.py',
source_dir='source_sklearn',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=output_path,
sagemaker_session=sagemaker_session,
hyperparameters={'priors': None, 'var_smoothing': 1e-9})

My train.py file (entrypoint)

from __future__ import print_function

import argparse
import os
import pandas as pd

from sklearn.externals import joblib

TODO: Import any additional libraries you need to define a model
from sklearn.base import BaseEstimator
from sklearn.naive_bayes import GaussianNB
Provided model load function
def model_fn(model_dir):
"""Load model from the model_dir. This is the same model that is saved
in the main if statement.
"""
print("Loading model.")
# load using joblib
model = joblib.load(os.path.join(model_dir, "model.joblib"))
print("Done loading model.")
return model
TODO: Complete the main code
if __name__ == '__main__':
# All of the model parameters and training parameters are sent as arguments
# when this script is executed, during a training job
# Here we set up an argument parser to easily access the parameters
parser = argparse.ArgumentParser()

# SageMaker parameters, like the directories for training data and saving models; set automatically
# Do not need to change
parser.add_argument('--output-data-dir', type=str, default=os.environ)
parser.add_argument('--model-dir', type=str, default=os.environ)
parser.add_argument('--data-dir', type=str, default=os.environ)
## TODO: Add any additional arguments that you will need to pass into your model
parser.add_argument('--priors', type=str, default=None, metavar='Priors', help='prior probabilities of the classes (default: None)')
parser.add_argument('--var_smoothing', type=float, default=1e-09, metavar='Var', help='portion of largest variance (default: 1e-9)') 
# args holds all passed-in arguments
args = parser.parse_args()

# Read in csv training file
training_dir = args.data_dir
train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None)

# Labels are in the first column
train_y = train_data.ilochttps://forums.aws.amazon.com/
train_x = train_data.ilochttps://forums.aws.amazon.com/
## --- Your code here --- ##

## TODO: Define a model 
model = GaussianNB(priors=args.priors,var_smoothing=args.var_smoothing) #priors=None, var_smoothing=1e-09
#priors=args.priors,var_smoothing=args.var_smoothing
## TODO: Train the model
model.fit(train_x, train_y)#() #train_y, train_x
## --- End of your code --- ##

# Save the trained model
joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))

Here I attempt to train scikit_estimator:
scikit_estimator.fit(train_data_path)

path to train.csv data: s3://sagemaker-us-east-2-709203276624/sagemaker/plagiarism_data_project/train.csv

2019-08-01 22:02:57 Starting - Starting the training job...
2019-08-01 22:02:59 Starting - Launching requested ML instances......
2019-08-01 22:04:00 Starting - Preparing the instances for training...
2019-08-01 22:04:54 Downloading - Downloading input data
2019-08-01 22:04:54 Training - Training image download completed. Training in progress..
2019-08-01 22:04:54,697 sagemaker-containers INFO Imported framework sagemaker_sklearn_container.training
2019-08-01 22:04:54,699 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)
2019-08-01 22:04:54,711 sagemaker_sklearn_container.training INFO Invoking user training script.
2019-08-01 22:04:54,968 sagemaker-containers INFO Module train does not provide a setup.py. 
Generating setup.py
2019-08-01 22:04:54,968 sagemaker-containers INFO Generating setup.cfg
2019-08-01 22:04:54,969 sagemaker-containers INFO Generating MANIFEST.in
2019-08-01 22:04:54,969 sagemaker-containers INFO Installing module with the following command:
/usr/bin/python3 -m pip install -U . 
Processing /opt/ml/code
Building wheels for collected packages: train
Building wheel for train (setup.py): started
Building wheel for train (setup.py): finished with status 'done'
Stored in directory: /tmp/pip-ephem-wheel-cache-1dq0l00y/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built train
Installing collected packages: train
Successfully installed train-1.0.0
WARNING: You are using pip version 19.1.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-08-01 22:04:56,184 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)
2019-08-01 22:04:56,196 sagemaker-containers INFO Invoking user script

Training Env:

{
"hosts": [
"algo-1"
],
"hyperparameters": {
"priors": null,
"var_smoothing": 1e-09
},
"user_entry_point": "train.py",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"model_dir": "/opt/ml/model",
"is_master": true,
"master_hostname": "algo-1",
"module_dir": "s3://sagemaker-us-east-2-709203276624/sagemaker-scikit-learn-2019-08-01-22-02-57-342/source/sourcedir.tar.gz",
"module_name": "train",
"num_cpus": 4,
"output_dir": "/opt/ml/output",
"input_dir": "/opt/ml/input",
"log_level": 20,
"additional_framework_parameters": {},
"framework_module": "sagemaker_sklearn_container.training:main",
"input_data_config": {
"training": {
"TrainingInputMode": "File",
"RecordWrapperType": "None",
"S3DistributionType": "FullyReplicated"
}
},
"current_host": "algo-1",
"job_name": "sagemaker-scikit-learn-2019-08-01-22-02-57-342",
"output_data_dir": "/opt/ml/output/data",
"input_config_dir": "/opt/ml/input/config",
"network_interface_name": "eth0",
"resource_config": {
"hosts": [
"algo-1"
],
"current_host": "algo-1",
"network_interface_name": "eth0"
},
"num_gpus": 0,
"channel_input_dirs": {
"training": "/opt/ml/input/data/training"
}
}

Environment variables:

SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":,"network_interface_name":"eth0"}
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_MODULE_NAME=train
SM_INPUT_DIR=/opt/ml/input
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":,"hyperparameters":{"priors":null,"var_smoothing":1e-09},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2019-08-01-22-02-57-342","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-2-709203276624/sagemaker-scikit-learn-2019-08-01-22-02-57-342/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":,"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_USER_ARGS=
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_NETWORK_INTERFACE_NAME=eth0
SM_FRAMEWORK_PARAMS={}
SM_CHANNELS=
SM_LOG_LEVEL=20
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_CURRENT_HOST=algo-1
SM_USER_ENTRY_POINT=train.py
SM_OUTPUT_DIR=/opt/ml/output
SM_HPS={"priors":null,"var_smoothing":1e-09}
SM_NUM_GPUS=0
SM_HOSTS=
PYTHONPATH=/usr/local/bin:/usr/lib/python35.zip:/usr/lib/python3.5:/usr/lib/python3.5/plat-x86_64-linux-gnu:/usr/lib/python3.5/lib-dynload:/usr/local/lib/python3.5/dist-packages:/usr/lib/python3/dist-packages
SM_MODULE_DIR=s3://sagemaker-us-east-2-709203276624/sagemaker-scikit-learn-2019-08-01-22-02-57-342/source/sourcedir.tar.gz
SM_MODEL_DIR=/opt/ml/model
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_HP_PRIORS=
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_NUM_CPUS=4
SM_HP_VAR_SMOOTHING=1e-09

Invoking script with the following command:

/usr/bin/python3 -m train --priors --var_smoothing 1e-09
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/ml/code/train.py", line 41, in <module>
parser.add_argument('--data-dir', type=str, default=os.environ)
File "/usr/lib/python3.5/os.py", line 725, in __getitem__
raise KeyError(key) from None
KeyError: 'SM_CHANNEL_TRAIN'
2019-08-01 22:04:57,405 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/usr/bin/python3 -m train --priors --var_smoothing 1e-09"

2019-08-01 22:05:05 Uploading - Uploading generated training model
2019-08-01 22:05:05 Failed - Training job failed
ValueError Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
267 The API calls the Amazon SageMaker CreateTrainingJob API to start
268 model training. The API uses configuration you provided to create the
--> 269 estimator and the specified input training data to send the
270 CreatingTrainingJob request to Amazon SageMaker.
271 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
719 security groups, or else validate and return an optional override value.
720 
--> 721 Args:
722 vpc_config_override:
723 """

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
1394 last_describe_job_call = time.time()
1395 last_description = description
-> 1396 while True:
1397 if len(stream_names) < instance_count:
1398 # Log streams are created whenever a container starts writing to stdout/err, so

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
1026 poll (int): Polling interval in seconds (default: 5).
1027 
-> 1028 Returns:
1029 (dict): Return value from the ``DescribeHyperParameterTuningJob`` API.
1030 

ValueError: Error for Training job sagemaker-scikit-learn-2019-08-01-22-02-57-342: Failed Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/bin/python3 -m train --priors --var_smoothing 1e-09"

see error directly above

Source

AmiriMc

Most helpful comment

@icywang86rui -
Thank you for the help! Yes, that is exactly what was in my train.py script. However, I was able to get it working by adding parser.add_argument('--train', type=str, default=os.environ) into the script and then where I create the estimator in the notebook (e.g. estimator.fit), I had to pass in a dictionary , scikit_estimator.fit({'train': train_data_path}), instead of just scikit_estimator.fit(train_data_path)

I tried to close this out, but could not figure out how to get back to this post, until I got notification of your post. :) I'm still very new to Python, AWS, and ML.

AmiriMc on 2 Aug 2019

👍4

All 3 comments

@AmiriMc -
Is the train.py script you provided there exactly what you used for the training job? This part here doesn't make sense to me:

parser.add_argument('--output-data-dir', type=str, default=os.environ)
parser.add_argument('--model-dir', type=str, default=os.environ)
parser.add_argument('--data-dir', type=str, default=os.environ)

The default values should be read from specific env vars. Are you doing something like this:

parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

Based on the stack trace this key/(env var) does not exist. The actually env var name is SM_CHANNEL_TRAINING:

SM_CHANNEL_TRAINING=/opt/ml/input/data/training

I think if you read from this env var the script will execute correctly.

icywang86rui on 2 Aug 2019

👍1

I tried to close this out, but could not figure out how to get back to this post, until I got notification of your post. :) I'm still very new to Python, AWS, and ML.

AmiriMc on 2 Aug 2019

👍4

Closing this out.

AmiriMc on 2 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings