Hi,
I am developing my own algorithm following the decision trees example at awslabs - scikit_bring_your_own. I copied the Dockerfile, sklearn-task-def.json and the decision_trees folder containing train, serve, predictor.py, wsgi.py and nginx.conf to my Jupyter folder. I set the following permissions: AmazonEC2ContainerServiceFullAccess, AmazonEC2ContainerRegistryFullAccess and AmazonSageMakerFullAccess. Instead of running the code from the command line, I'm using Jupyter.
Then, I run:
! docker build -t decision-trees .
! aws ecr create-repository --repository-name decision-trees
! docker tag decision-trees:latest aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees
! aws ecr get-login --no-include-email
! docker login -u abc -p abc12345 http://abc123
! docker push aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest
! aws ecs register-task-definition --cli-input-json file://decision-trees-task-def.json
The repo is successfully created as well as the task, both ACTIVE. After that, I set up the permissions in the ECS repository to my own account, full access.
In Python 2 environment, I run the following code:
import pandas as pd
import numpy as np
import sagemaker
from sagemaker.predictor import csv_serializer
import boto3
import re
import os
from sagemaker import get_execution_role
role = get_execution_role()
df=pd.read_csv('s3://sage-maker/DadosTeseLogit.csv',sep=',',header=0)
sel=np.where(df.corr()['selected']>.5)[0][0:-1]
df=df.iloc[:,np.concatenate([[30],sel])]
bucket = 'sage-maker-4'
prefix = 'sagemaker/xgboost-churn'
df.to_csv('train.csv',header=False, index=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
containers = {'us-east-1': '1234567.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest'}
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path='s3://{}/{}/output'.format(bucket, prefix),
sagemaker_session=sess)
xgb.fit({'train': s3_input_train})
Initially I got a regex problem, because '1234567.dkr.ecr.us-east-1.amazonaws.com/decision_trees:latest' is not allowed (' _ '), then I changed to 'decision-trees'. But now I'm getting the following error:
`..............
ValueError Traceback (most recent call last)
----> 1 xgb.fit({'train': s3_input_train})
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
177 self.latest_training_job = _TrainingJob.start_new(self, inputs)
178 if wait:
--> 179 self.latest_training_job.wait(logs=logs)
180
181 @classmethod
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
366 def wait(self, logs=True):
367 if logs:
--> 368 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
369 else:
370 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
784
785 if wait:
--> 786 self._check_job_status(job_name, description, 'TrainingJobStatus')
787 if dot:
788 print()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
524 if status != 'Completed' and status != 'Stopped':
525 reason = desc.get('FailureReason', '(No reason provided)')
--> 526 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
527
528 def wait_for_endpoint(self, endpoint, poll=5):
ValueError: Error training decision-trees-2018-06-10-23-27-03-210: Failed Reason: AlgorithmError: Exit Code: 1`
This is funny, because the Docker file explicitly states:
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
COPY decision_trees /opt/program
COPY decision_trees/train /opt/program
COPY decision_trees/serve /opt/program
COPY decision_trees/predictor.py /opt/program
COPY decision_trees/nginx.conf /opt/program
COPY decision_trees/wsgi.py /opt/program
WORKDIR /opt/program
Any ideas on how to overcome this issue ?
Thanks in advance.
I ran into the same problem when adapting from the scikit-build-your-own example.
Found this issue because I googled that error message!
In my case, this was because I hadn't updated the build script to make the new train & serve scripts executable.
i.e.
in the Notebook describing the process, there's a big %%sh cell where the image is built (amongst other things), the lines:
chmod +x decision_trees/train
chmod +x decision_trees/serve
make sure that docker can actually execute these scripts.
In my example, i had changed the names of everything (including those directories) from decision_terees to something else. Updating the above lines to point at the correct directory solved this issue for me.
Thanks @gregroberts I will check it. In fact I didn't run the chmod command and I will fix the docker commands order.
It's working now. I had to edit the "train" file inside decision_trees folder. I used the following sequence:
! sudo service docker start
! sudo usermod -a -G docker ec2-user
! docker info
! chmod +x decision_trees/train
! chmod +x decision_trees/serve
! aws ecr create-repository --repository-name decision-trees
! aws ecr get-login --no-include-email
! docker login -u abc -p abc12345 http://abc123
! docker build -t decision-trees .
! docker tag decision-trees aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest
! docker push aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest
! aws ecs register-task-definition --cli-input-json file://decision-trees-task-def.json
It seems the issue was resolved.
Please, let us know if there is anything we can assist you with.
Yes, @nadiaya thank you very much. I'm closing this issue.
@RubensZimbres What changes did you make to the "train" file inside the decision_trees folder in order to get it to work?
@zmabzug In fact the train file is directly accessing the S3 bucket, instead of the Jupyter Notebook doing it.
Hi Rubens,
I am facing exactly same issue while running train script for decision tree example. Can you tell me what are the changes that you have done?
I did run chmod on decision_trees/train as mentioned by Greg. Still I am facing the same issue.
Thank You,
Prasad
@prasadpande1990
After running chmod on decision_trees/train as mentioned by Greg, after that again build the docker image, and then try to run again.
Hope it might work.
Most helpful comment
I ran into the same problem when adapting from the scikit-build-your-own example.
Found this issue because I googled that error message!
In my case, this was because I hadn't updated the build script to make the new train & serve scripts executable.
i.e.
in the Notebook describing the process, there's a big %%sh cell where the image is built (amongst other things), the lines:
make sure that docker can actually execute these scripts.
In my example, i had changed the names of everything (including those directories) from decision_terees to something else. Updating the above lines to point at the correct directory solved this issue for me.