Sagemaker-python-sdk: exec: "train": executable file not found in $PATH

Created on 11 Jun 2018 · 9Comments · Source: aws/sagemaker-python-sdk

Hi,

I am developing my own algorithm following the decision trees example at awslabs - scikit_bring_your_own. I copied the Dockerfile, sklearn-task-def.json and the decision_trees folder containing train, serve, predictor.py, wsgi.py and nginx.conf to my Jupyter folder. I set the following permissions: AmazonEC2ContainerServiceFullAccess, AmazonEC2ContainerRegistryFullAccess and AmazonSageMakerFullAccess. Instead of running the code from the command line, I'm using Jupyter.

Then, I run:

! docker build -t decision-trees .
! aws ecr create-repository --repository-name decision-trees
! docker tag decision-trees:latest aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees
! aws ecr get-login --no-include-email
! docker login -u abc -p abc12345 http://abc123
! docker push aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest
! aws ecs register-task-definition --cli-input-json file://decision-trees-task-def.json

The repo is successfully created as well as the task, both ACTIVE. After that, I set up the permissions in the ECS repository to my own account, full access.

In Python 2 environment, I run the following code:

import pandas as pd
import numpy as np
import sagemaker
from sagemaker.predictor import csv_serializer
import boto3
import re
import os
from sagemaker import get_execution_role

role = get_execution_role()
df=pd.read_csv('s3://sage-maker/DadosTeseLogit.csv',sep=',',header=0)
sel=np.where(df.corr()['selected']>.5)[0][0:-1]
df=df.iloc[:,np.concatenate([[30],sel])]
bucket = 'sage-maker-4'
prefix = 'sagemaker/xgboost-churn'
df.to_csv('train.csv',header=False, index=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
containers = {'us-east-1': '1234567.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest'}
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path='s3://{}/{}/output'.format(bucket, prefix),
sagemaker_session=sess)
xgb.fit({'train': s3_input_train})

Initially I got a regex problem, because '1234567.dkr.ecr.us-east-1.amazonaws.com/decision_trees:latest' is not allowed (' _ '), then I changed to 'decision-trees'. But now I'm getting the following error:

`..............

exec: "train": executable file not found in $PATH

ValueError Traceback (most recent call last)
in ()
----> 1 xgb.fit({'train': s3_input_train})

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
177 self.latest_training_job = _TrainingJob.start_new(self, inputs)
178 if wait:
--> 179 self.latest_training_job.wait(logs=logs)
180
181 @classmethod

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
366 def wait(self, logs=True):
367 if logs:
--> 368 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
369 else:
370 self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
784
785 if wait:
--> 786 self._check_job_status(job_name, description, 'TrainingJobStatus')
787 if dot:
788 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
524 if status != 'Completed' and status != 'Stopped':
525 reason = desc.get('FailureReason', '(No reason provided)')
--> 526 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
527
528 def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training decision-trees-2018-06-10-23-27-03-210: Failed Reason: AlgorithmError: Exit Code: 1`

This is funny, because the Docker file explicitly states:

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY decision_trees /opt/program
COPY decision_trees/train /opt/program
COPY decision_trees/serve /opt/program
COPY decision_trees/predictor.py /opt/program
COPY decision_trees/nginx.conf /opt/program
COPY decision_trees/wsgi.py /opt/program
WORKDIR /opt/program

Any ideas on how to overcome this issue ?

Thanks in advance.

Source

RubensZimbres

👍1

Most helpful comment

I ran into the same problem when adapting from the scikit-build-your-own example.

Found this issue because I googled that error message!

In my case, this was because I hadn't updated the build script to make the new train & serve scripts executable.

i.e.

in the Notebook describing the process, there's a big %%sh cell where the image is built (amongst other things), the lines:

chmod +x decision_trees/train
chmod +x decision_trees/serve

make sure that docker can actually execute these scripts.

In my example, i had changed the names of everything (including those directories) from decision_terees to something else. Updating the above lines to point at the correct directory solved this issue for me.

gregroberts on 13 Jun 2018

👍7

All 9 comments

I ran into the same problem when adapting from the scikit-build-your-own example.

Found this issue because I googled that error message!

In my case, this was because I hadn't updated the build script to make the new train & serve scripts executable.

i.e.

in the Notebook describing the process, there's a big %%sh cell where the image is built (amongst other things), the lines:

chmod +x decision_trees/train
chmod +x decision_trees/serve

make sure that docker can actually execute these scripts.

gregroberts on 13 Jun 2018

👍7

Thanks @gregroberts I will check it. In fact I didn't run the chmod command and I will fix the docker commands order.

RubensZimbres on 14 Jun 2018

It's working now. I had to edit the "train" file inside decision_trees folder. I used the following sequence:

! sudo service docker start
! sudo usermod -a -G docker ec2-user
! docker info
! chmod +x decision_trees/train
! chmod +x decision_trees/serve
! aws ecr create-repository --repository-name decision-trees
! aws ecr get-login --no-include-email
! docker login -u abc -p abc12345 http://abc123
! docker build -t decision-trees .
! docker tag decision-trees aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest
! docker push aws_account_id.dkr.ecr.us-east-1.amazonaws.com/decision-trees:latest
! aws ecs register-task-definition --cli-input-json file://decision-trees-task-def.json

RubensZimbres on 17 Jun 2018

👍1

It seems the issue was resolved.
Please, let us know if there is anything we can assist you with.

nadiaya on 19 Jun 2018

Yes, @nadiaya thank you very much. I'm closing this issue.

RubensZimbres on 19 Jun 2018

@RubensZimbres What changes did you make to the "train" file inside the decision_trees folder in order to get it to work?

zmabzug on 2 Jul 2018

@zmabzug In fact the train file is directly accessing the S3 bucket, instead of the Jupyter Notebook doing it.

RubensZimbres on 2 Jul 2018

Hi Rubens,

I am facing exactly same issue while running train script for decision tree example. Can you tell me what are the changes that you have done?

I did run chmod on decision_trees/train as mentioned by Greg. Still I am facing the same issue.

Thank You,
Prasad

prasadpande1990 on 18 May 2019

@prasadpande1990
After running chmod on decision_trees/train as mentioned by Greg, after that again build the docker image, and then try to run again.

Hope it might work.