Pipelines: Sagemaker Custom Training Job Error: Unable to locate botocore.credentials

Created on 1 May 2020 · 6Comments · Source: kubeflow/pipelines

What steps did you take:

I run a custom image using the sagemaker training operator (https://raw.githubusercontent.com/kubeflow/pipelines/master/components/aws/sagemaker/train/component.yaml) and it ran fine. I am using kfp.aws.use_aws_secret and the objects from s3 are being correctly copied over to the specified local channel path.

The problem arises however if inside the custom script I use boto3 to manually download an object from s3 - then I get an error: Unable to locate credentials ...

What happened:

Below is a copy of the component's logs - notice the very first log statement says that the boto credentials are found in environment variables ... but somehow they never make their way to the boto3 client that is instantiated inside the custom image

INFO:botocore.credentials:Found credentials in environment variables.
INFO:root:Submitting Training Job to SageMaker...
INFO:root:Created Training Job with name: TrainingJob-20200430232331-LPHY
INFO:root:Training job in SageMaker: 
https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/TrainingJob-20200430232331-LPHY
INFO:root:CloudWatch logs: 
https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=TrainingJob-20200430232331-LPHY;streamFilter=typeLogStreamPrefix
INFO:root:Job request submitted. Waiting for completion...
INFO:root:Training job is still in status: InProgress
INFO:root:Training job is still in status: InProgress
INFO:root:Training job is still in status: InProgress
INFO:root:Training job is still in status: InProgress
INFO:root:Training job is still in status: InProgress
INFO:root:Training job is still in status: InProgress
INFO:root:Training job is still in status: InProgress
INFO:root:Training failed with the following error: AlgorithmError: Exception during training: Unable to locate credentials
Traceback (most recent call last):
  File "main.py", line 174, in main
    preprocessor_path = get_local_path(params["preprocessor_path"])
  File "main.py", line 86, in get_local_path
    for s3_object in s3_bucket.objects.all():
  File "/opt/conda/lib/python3.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
    for page in self.pages():
  File "/opt/conda/lib/python3.7/site-packages/boto3/resources/collection.py", line 166, in pages
    for page in pages:
  File "/opt/conda/lib/python3.7/site-packages/botocore/paginate.py", line 255, in __iter__
    response = self._make_request(current_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/botocore/paginate.py", line 332, in _make_request
    return self._method(**current_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/lib/python3.7/site-packag
Traceback (most recent call last):
  File "train.py", line 81, in <module>
    main()
  File "train.py", line 64, in main
    _utils.wait_for_training_job(client, job_name)
  File "/app/common/_utils.py", line 185, in wait_for_training_job
    raise Exception('Training job failed')
Exception: Training job failed

What did you expect to happen:

I would have expected the credentials to be passed to the image that the training operator is running but it is not the case ...

Environment:

How did you deploy Kubeflow Pipelines (KFP)?
I deployed kubeflow pipelines as part of my kubeflow deployment on AWS EKS:

KFP version:
Build commit: 743746b

KFP SDK version:
0.5.0

/kind bug

help wanted kinbug statutriaged

Source

marwan116

Most helpful comment

@surajkota - thank you so much for taking the time to reproduce this - yes you are right it is because I had network_isolation set to True - (sorry I should have taken the time to understand what network_isolation does)

Also thank you for the clarifications!

I saw @gautamkmr graciously took the time to open an issue concerning the logs - thank you @gautamkmr !

I am closing this now as this particular issue is now resolved

marwan116 on 12 May 2020

👍2

All 6 comments

Thanks Marwan for trying out the component.
I believe your script was buried inside your custom image, if so that custom image runs inside sagemaker which does not inherit the use_aws_secret values. So either you need to add permission to the role https://github.com/kubeflow/pipelines/blob/master/components/aws/sagemaker/train/component.yaml#L10 or read it from AWS secret manager.

Would you mind sharing your script or minimal reproducible code ?

goswamig on 7 May 2020

@gautamkmr - thank you for taking the time to respond to this issue

Please find below a very simplified version of the script I'd like to run but hopefully should be good enough to show where the issue is - please note the comments in the script

import pathlib
import os
import boto3
import sys


def main():
    try:
        # Reading data that sagemaker has copied from s3
        # works fine
        prefix = pathlib.Path('/opt/ml/')
        input_path = prefix / 'input/data/train/'

        with open(input_path / 'test.txt', 'r') as f:
            content = f.read()

        assert 'hello world' in content

        # the below portion is trying to read data from s3
        # using boto3 but it fails
        bucket_name = os.environ['AWS_BUCKET']
        object_name = 'dummy_input/test.txt'
        file_name = 'test.txt'

        s3 = boto3.client('s3')

        # specifically the below line fails:
        # botocore.exceptions.NoCredentialsError: Unable to locate credentials
        s3.download_file(bucket_name, object_name, file_name)

    except Exception:
        sys.exit(255)

Here is the script for compiling the pipeline just in case you need it

import kfp.compiler as compiler
import json
import os
from kfp import components, dsl
from kfp.aws import use_aws_secret


@dsl.pipeline(
    name="sm_kfp_example",
    description="sample sagemaker training job"
)
def sm_kfp_example():
    bucket = os.environ.get('AWS_BUCKET')
    role = os.environ.get('SAGEMAKER_ROLE_ARN')

    train_channels = json.dumps([{
        'ChannelName': 'train',
        'DataSource': {
            'S3DataSource': {
                'S3Uri': f's3://{bucket}/dummy_input/',
                'S3DataType': 'S3Prefix',
                'S3DataDistributionType': 'FullyReplicated'
            }
        },
        'ContentType': '',
        'CompressionType': 'None',
        'RecordWrapperType': 'None',
        'InputMode': 'File'
    }])

    # use the sagemaker training operator defined by aws
    # [a wrapper around Sagemaker CreateTrainingJob]
    repo_path = 'https://raw.githubusercontent.com/kubeflow/pipelines'
    # commit hash of current version of kfp that we are using
    commit = 'master'
    suffix = 'components/aws/sagemaker/train/component.yaml'
    path = f'{repo_path}/{commit}/{suffix}'

    sagemaker_train_op = components.load_component_from_url(path)
    output_path = f's3://{bucket}/output'
    account_id = os.environ.get('AWS_ACCOUNT_ID')
    region = os.environ.get('AWS_REGION')
    image = f'{account_id}.dkr.ecr.{region}.amazonaws.com/sm_kfp_example:latest'

    _ = sagemaker_train_op(
        region=region,
        endpoint_url='',
        image=image,
        training_input_mode='File',
        hyperparameters='{}',
        channels=train_channels,
        instance_type='ml.m5.xlarge',
        instance_count='1',
        volume_size='20',
        max_run_time='3600',
        model_artifact_path=output_path,
        output_encryption_key='',
        network_isolation='True',
        traffic_encryption='False',
        spot_instance='False',
        max_wait_time='3600',
        checkpoint_config='{}',
        role=role,
    ).apply(
        use_aws_secret('aws-secret', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY')
    )


def compile(pipeline_func):
    pipeline_filename = pipeline_func.__name__ + ".pipeline.tar.gz"
    compiler.Compiler().compile(pipeline_func, pipeline_filename)


if __name__ == "__main__":
    # compile the pipeline
    compile(sm_kfp_example)

The very strange thing is if I try to create the training job using the sagemaker python sdk (i.e. not using sagemaker's k8s training operator) - the script runs fine - i.e. the credentials are passed down to the container - below is the script in case you need it

import boto3
import sagemaker
from sagemaker.estimator import Estimator
import os

aws_region = os.environ['AWS_REGION']
algorithm_name = "sm_kfp_example"
s3_bucket = os.environ['AWS_BUCKET']

# use the security token service to verify the account identity
client = boto3.client('sts')
account = client.get_caller_identity()['Account']

# set the sagemaker role
role = os.environ['SAGEMAKER_ROLE_ARN']

# create a boto_session
boto_session = boto3.session.Session(
    region_name=aws_region,
)

# get full training image url
training_image = f"{account}.dkr.ecr.{aws_region}.amazonaws.com/{algorithm_name}:latest"


# specify location on s3_bucket to output the results
s3_output_location = f's3://{s3_bucket}/output'

# create a sagemaker_session
sagemaker_session = sagemaker.session.Session(boto_session=boto_session)

# create an estimator
estimator = Estimator(
    training_image,
    role,
    train_instance_count=1,
    train_instance_type='ml.m5.xlarge',
    train_volume_size=10,  # 10 GB
    train_max_run=600,  # 10 minutes = 600seconds
    input_mode='File',
    output_path=s3_output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters={},
    base_job_name="sagemaker-sample",
)

estimator.fit(
    inputs={
        'train': f's3://{s3_bucket}/dummy_input/'
    },
    logs=True
)

marwan116 on 8 May 2020

Note, to avoid opening other Sagemaker issues, I will list out some of the pain points I have faced trying to integrate sagemaker with Kubeflow here and let me know if there are any solutions to these - excuse me for not following protocol - but if these are deemed as valid issues -I would be glad to open up the relevant issues:

I can't seem to pass an image to sagemaker_training_op that is not hosted on ECR and if the image is hosted on ECR - it has to be in the same region as that specified for sagemaker_training_op ...
Currently, the sagemaker logs are being output to cloudwatch not to kubeflow (would be much easier if they can be forwarded to kubeflow)

marwan116 on 8 May 2020

There has been some undocumented change to the load_component_* functions. It used to return a ContainerOp, now it returns a TaskSpec instead.

Currently there are 2 possible workaround:

use the private func _create_container_op_from_component_and_arguments to generate ur containerop from taskspec

from kfp.dsl._component_bridge import _create_container_op_from_component_and_arguments

component_op = components.load_component_from_url(...)
taskspec = component_op(...)
containerop = _create_container_op_from_component_and_arguments(
  taskspec.component_ref.spec, 
  taskspec.arguments, 
  taskspec.component_ref
)

overwrite the default _default_container_task_constructor

from kfp.dsl._component_bridge import _create_container_op_from_component_and_arguments
import kfp.components._components as _components

_components._default_container_task_constructor = _create_container_op_from_component_and_arguments

# now load_component will return a containerop
containerop = components.load_component_from_url(...)

PS: @Ark-kun this is not the first time I seen this issue/qns - what do u think?

eterna2 on 12 May 2020

Hi @marwan116, I was able to reproduce the failure and the root cause is that the default value for network_isolation parameter is set to False in python sdk whereas in the pipeline definition you provided it is set to True, which is also the default value in training component

Can you try set it to False and let us know if your issue has been resolved ?

Here are some clarifications based on posts on this thread:

The logs you posted in the issue initially under whats happened section (except the exception) is from the component pod and NOT from the training job itself. As you have already observed, for the training job logs you need to go to cloudwatch.
- the first log line which you see INFO:botocore.credentials:Found credentials in environment variables. is from boto session which is created by the component backend to call create_training_job API. It uses the credentials are from aws-secret that you would have created. These credentials are only used to invoke the job and are not passed to the instance in SageMaker
The SageMaker instance which runs in AWS assumes the credentials from the role ARN you provide inSAGEMAKER_ROLE_ARN not not from the secret

Let us know if you have more questions.