Aws-cdk: Batch and UserData synths ok but fails deploy with: "Operation failed, ComputeEnvironment went INVALID with error: CLIENT_ERROR - Launch Template UserData is not MIME multipart format"

Created on 25 Feb 2020 · 14Comments · Source: aws/aws-cdk

The context for this issue is well described in the Gitter aws-cdk channel over here, along with the tests performed by @skinny85, @reisingerf and myself:

https://gitter.im/awslabs/aws-cdk?at=5e54579d9aeef6523217b25f

Reproduction Steps

Given the following snippet, as described in the Gitter thread:

const vpc = ec2.Vpc.fromLookup(this, 'Vpc', {
            isDefault: true,
        });

        const batch_instance_role = new iam.Role(this, 'BatchInstanceRole', {
            roleName: 'UmccriseBatchInstanceRole',
            assumedBy: new iam.CompositePrincipal(
                new iam.ServicePrincipal('ec2.amazonaws.com'),
                new iam.ServicePrincipal('ecs.amazonaws.com'),
            ),
            managedPolicies: [
                iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2RoleforSSM'),
                iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2ContainerServiceforEC2Role')
            ],
        });
        const spotfleet_role = new iam.Role(this, 'AmazonEC2SpotFleetRole', {
            assumedBy: new iam.ServicePrincipal('spotfleet.amazonaws.com'),
            managedPolicies: [
                iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2SpotFleetTaggingRole'),
            ],
        });
        const batch_service_role = new iam.Role(this, 'BatchServiceRole', {
            assumedBy: new iam.ServicePrincipal('batch.amazonaws.com'),
            managedPolicies: [
                iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSBatchServiceRole'),
            ],
        });
        const batch_instance_profile = new iam.CfnInstanceProfile(this, 'BatchInstanceProfile', {
            instanceProfileName: 'UmccriseBatchInstanceProfile',
            roles: [batch_instance_role.roleName],
        });

        const launch_template = new ec2.CfnLaunchTemplate(this, 'LaunchTemplate', {
            launchTemplateData: {
                userData: core.Fn.base64(`
                    MIME-Version: 1.0
                    Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

                    --==MYBOUNDARY==
                    Content-Type: text/x-shellscript; charset="us-ascii"

                    #!/bin/bash          
                    echo Hello

                    --==MYBOUNDARY==--
                `),
            },
            launchTemplateName: 'UmccriseBatchComputeLaunchTemplate',
        });
        new batch.CfnComputeEnvironment(this, 'UmccriseBatchComputeEnv', {
            type: 'MANAGED',
            serviceRole: batch_service_role.roleArn,
            computeResources: {
                type: 'SPOT',
                maxvCpus: 128,
                minvCpus: 0,
                desiredvCpus: 0,
                imageId: 'ami-05c621ca32de56e7a',
                launchTemplate: {
                    launchTemplateId: launch_template.ref,
                    version: launch_template.attrLatestVersionNumber,
                },
                spotIamFleetRole: spotfleet_role.roleArn,
                instanceRole: batch_instance_profile.instanceProfileName!,
                instanceTypes: ['optimal'],
                subnets: [vpc.publicSubnets[0].subnetId],
                securityGroupIds: ['sg-0a5cf974'],
                tags: { 'Creator': 'Batch' },
            }
        });

For more context, there's this other working example too:

https://github.com/awslabs/aws-batch-helpers/issues/5#issue-425133706

Error Log

This:

Operation failed, ComputeEnvironment went INVALID with error: CLIENT_ERROR - Launch Template UserData is not MIME multipart format

Coupled with the deploy time error (in Python, ask @skinny85 for the TypeScript counterpart):

 6/10 | 10:01:20 AM | UPDATE_FAILED        | AWS::Batch::ComputeEnvironment        | UmccriseBatchComputeEnv Operation failed, ComputeEnvironment went INVALID with error: CLIENT_ERROR - Launch Template UserData is not MIME multipart format
        /Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7838:49
        \_ Kernel._wrapSandboxCode (/Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8298:20)
        \_ Kernel._create (/Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7838:26)
        \_ Kernel.create (/Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7585:21)
        \_ KernelHost.processRequest (/Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7372:28)
        \_ KernelHost.run (/Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7312:14)
        \_ Immediate._onImmediate (/Users/romanvg/.miniconda3/envs/cdk/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7315:37)
        \_ processImmediate (internal/timers.js:456:21)

Environment

CLI Version : 1.25.0 (build 5ced526)
Framework Version: 1.25.0 (build 5ced526)
OS : MacOS Catalina 10.15.3
Language : Python and Typescript

This is :bug: Bug Report

@aws-cdaws-batch bug response-requested

Source

brainstorm

Most helpful comment

After much playing around I've managed to see a workaround for a python deployment.
Reading in the user data from a file is better than using a multi-line string.

Part 1: Read in the user data

with open("user_data/user_data.txt", 'r') as user_data_h:
            user_data = user_data_h.read()

Part 2: Assign as a Userdata object with the custom method

user_init = ec2.UserData.custom(user_data)

Part 3: Add to launch_template_data dict

The render magically gets rid of the lines attribute in the stack and the base64 re-encodes it as appropriate

launch_template_data = {
     "UserData": core.Fn.base64(user_init.render())
}

Part 4: Initialise the launch template

launch_template = ec2.CfnLaunchTemplate(self, "LaunchTemplate", launch_template_name="UmccriseBatchComputeLaunchTemplateDev")

Part 5: Override the launch template property

Adding in userdata in the previous step under the kwarg launch_template_data doesn't seem to work so we override the property using the add_property_override instead

launch_template.add_property_override("LaunchTemplateData", launch_template_data)

Part 6: Validate

Our launch template should look like this after running cdk synth

LaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64: >-
            MIME-Version: 1.0

            Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="


            --==MYBOUNDARY==

            Content-Type: text/x-shellscript; charset="us-ascii"


            #!/bin/bash

            echo Hello


            --==MYBOUNDARY==--

alexiswl on 6 Mar 2020

🚀1 🎉1

All 14 comments

Here's the original CDK-Python snippet that fails in the same way and triggered the Gitter discussion:

https://github.com/umccr/infrastructure/blob/495404bbfb0b3b6bf50d9640b2bc012f851c3600/cdk/apps/umccrise/stacks/batch.py

brainstorm on 25 Feb 2020

There seems to be something funny going on with new line handling.
It was brought to my attention, by an AWS support engineer, that there shouldn't be any empty lines in user data scripts. However, whenever I try to create a multi-line string in Python it seems to add empty new lines when writing the CF template.
I am not sure that is the real issue, but it is unexpected and looks wrong.

For example:

user_data_script = """
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
/bin/echo "Hello World" >> /tmp/testfile.txt
--//
"""

with 'userData': core.Fn.base64(user_data_script), would end up being:

      LaunchTemplateData:
        UserData:
          Fn::Base64: >

            MIME-Version: 1.0

            Content-Type: multipart/mixed; boundary="//"


            --//

            Content-Type: text/x-shellscript; charset="us-ascii"

            #!/bin/bash

            /bin/echo "Hello World" >> /tmp/testfile.txt

            --//
      LaunchTemplateName: UmccriseBatchComputeLaunchTemplate

A 'userData': core.Fn.base64(core.Fn.join(delimiter='\n', list_of_values=['#!/bin/bash', 'echo Hello'])) has the same issue of adding empty lines.

Also, tests without core.Fn.base64 seem to have that issue:

'userData': '#!/bin/bash\necho FOO'
produces:

      LaunchTemplateData:
        UserData: >-
          #!/bin/bash

          echo FOO
      LaunchTemplateName: UmccriseBatchComputeLaunchTemplate

Unexpected new line

'userData': '#!/bin/bash\recho FOO'
A attempt to overwrite the additional new line, but produces:

      LaunchTemplateData:
        UserData: "#!/bin/bash\recho FOO"
      LaunchTemplateName: UmccriseBatchComputeLaunchTemplate

Carriage return not recognised?

reisingerf on 27 Feb 2020

@brainstorm I believe the problem is that when you declare a multi line string in python like so:

user_data = `
my-script
`

What you actually get is '\nmy-script\n. This causes the first line in the script to be empty, which violates the User Data Formats.

Try replacing your declaration with:

user_data = `MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash          
echo Hello

--==MYBOUNDARY==--`

iliapolo on 5 Mar 2020

Thanks for the suggestion!
However, I'm afraid I don't understand. If I run your user_data code I get:

Traceback (most recent call last):
  File "app.py", line 4, in <module>
    from stacks.batch import BatchStack
  File "/Users/freisinger/Devel/projects/github/UMCCR/infrastructure/cdk/apps/umccrise/stacks/batch.py", line 22
    user_data = `MIME-Version: 1.0
                ^
SyntaxError: invalid syntax
Subprocess exited with error 1

I did use triple quotes for my multiline user data string, but I still end up with extra empty lines.

We also use an array of strings with core.Fn.join, with the same result of added empty lines.
See GitHub link above for a flavour of our attempts...

reisingerf on 6 Mar 2020

After much playing around I've managed to see a workaround for a python deployment.
Reading in the user data from a file is better than using a multi-line string.

Part 1: Read in the user data

with open("user_data/user_data.txt", 'r') as user_data_h:
            user_data = user_data_h.read()

Part 2: Assign as a Userdata object with the custom method

user_init = ec2.UserData.custom(user_data)

Part 3: Add to launch_template_data dict

The render magically gets rid of the lines attribute in the stack and the base64 re-encodes it as appropriate

launch_template_data = {
     "UserData": core.Fn.base64(user_init.render())
}

Part 4: Initialise the launch template

launch_template = ec2.CfnLaunchTemplate(self, "LaunchTemplate", launch_template_name="UmccriseBatchComputeLaunchTemplateDev")

Part 5: Override the launch template property

Adding in userdata in the previous step under the kwarg launch_template_data doesn't seem to work so we override the property using the add_property_override instead

launch_template.add_property_override("LaunchTemplateData", launch_template_data)

Part 6: Validate

Our launch template should look like this after running cdk synth

LaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        UserData:
          Fn::Base64: >-
            MIME-Version: 1.0

            Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="


            --==MYBOUNDARY==

            Content-Type: text/x-shellscript; charset="us-ascii"


            #!/bin/bash

            echo Hello


            --==MYBOUNDARY==--

alexiswl on 6 Mar 2020

🚀1 🎉1

@reisingerf Wrote:

Thanks for the suggestion!
However, I'm afraid I don't understand. If I run your user_data code I get:

Traceback (most recent call last):
  File "app.py", line 4, in <module>
    from stacks.batch import BatchStack
  File "/Users/freisinger/Devel/projects/github/UMCCR/infrastructure/cdk/apps/umccrise/stacks/batch.py", line 22
    user_data = `MIME-Version: 1.0
                ^
SyntaxError: invalid syntax
Subprocess exited with error 1

Sorry, I got mixed up with typescript/python multiline declarations :)

@reisingerf Wrote:

I did use triple quotes for my multiline user data string, but I still end up with extra empty lines.

We also use an array of strings with core.Fn.join, with the same result of added empty lines.
See GitHub link above for a flavour of our attempts...

I'm still pretty convinced its the multiline declaration problem. Here is a working (validated) snippet, based on the code @brainstorm posted.

    const vpc = ec2.Vpc.fromLookup(this, 'Vpc', {
      isDefault: true,
    });

    const batch_service_role = new iam.Role(this, 'BatchServiceRole', {
      assumedBy: new iam.ServicePrincipal('batch.amazonaws.com'),
      managedPolicies: [
          iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSBatchServiceRole'),
      ],
    });

    const launch_template = new ec2.CfnLaunchTemplate(this, 'LaunchTemplate', {
      launchTemplateData: {
          userData: cdk.Fn.base64(`MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
echo Hello

--==MYBOUNDARY==--`),
      },
      launchTemplateName: 'UmccriseBatchComputeLaunchTemplate',
    });

    const batch_instance_role = new iam.Role(this, 'BatchInstanceRole', {
      roleName: 'UmccriseBatchInstanceRole',
      assumedBy: new iam.CompositePrincipal(
          new iam.ServicePrincipal('ec2.amazonaws.com'),
          new iam.ServicePrincipal('ecs.amazonaws.com'),
      ),
      managedPolicies: [
          iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2RoleforSSM'),
          iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2ContainerServiceforEC2Role')
      ],
    });

    const batch_instance_profile = new iam.CfnInstanceProfile(this, 'BatchInstanceProfile', {
      instanceProfileName: 'UmccriseBatchInstanceProfile',
      roles: [batch_instance_role.roleName],
    });

    const spotfleet_role = new iam.Role(this, 'AmazonEC2SpotFleetRole', {
      assumedBy: new iam.ServicePrincipal('spotfleet.amazonaws.com'),
      managedPolicies: [
          iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonEC2SpotFleetTaggingRole'),
      ],
    });

    new batch.CfnComputeEnvironment(this, "Env", {
      type: "MANAGED",
      serviceRole: batch_service_role.roleArn,
      computeResources: {
        type: 'SPOT',
        maxvCpus: 128,
        minvCpus: 0,
        launchTemplate: {
            launchTemplateId: launch_template.ref,
            version: launch_template.attrLatestVersionNumber,
        },
        instanceRole: batch_instance_profile.instanceProfileName!,
        instanceTypes: ['optimal'],
        subnets: [vpc.publicSubnets[0].subnetId],
        spotIamFleetRole: spotfleet_role.roleArn,
      }
    })
  }

The missing part from my earlier suggestion is the indentation, notice:

const userData = `MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
echo Hello

--==MYBOUNDARY==--`

The python counterpart of using triple quotes should work the same.

Regarding your previous attempts you mentioned, here is my theory on why they didn't work:

https://github.com/umccr/infrastructure/blob/495404bbfb0b3b6bf50d9640b2bc012f851c3600/cdk/apps/umccrise/stacks/batch.py#L21 contains a \n in the beginning.
https://github.com/umccr/infrastructure/blob/495404bbfb0b3b6bf50d9640b2bc012f851c3600/cdk/apps/umccrise/stacks/batch.py#L21 is actually missing some blank lines in the middle (not 100% percent sure where exactly)
https://github.com/umccr/infrastructure/blob/495404bbfb0b3b6bf50d9640b2bc012f851c3600/cdk/apps/umccrise/stacks/batch.py#L37 same
https://github.com/umccr/infrastructure/blob/495404bbfb0b3b6bf50d9640b2bc012f851c3600/cdk/apps/umccrise/stacks/batch.py#L10 - This should actually work! I don't see exactly how you used it in the commented code...Was it just core.Fn.base64(user_data_script)?

iliapolo on 6 Mar 2020

@iliapolo thanks for looking into this!

I started from scratch and using your example I finally got the user data deployed. I don't know why the previous attempts were unsuccessful (partially due to the empty first line for sure, but we've tried so much this couldn't have been the only reason).
I also noticed (after quite some frustration) that my updates to my user data, seemed to be deployed, but actually did not change the user data run by the starting instances. Looking at the AWS console, I noticed that for each change to the LaunchTemplate a new version was created, but the instances always used the default version, which was the first and oldest one.

Any idea how I would go about using the latest version by default?

reisingerf on 9 Mar 2020

Actually it seems there are two LaunchTemplates created. The one I defined in my CDK stack and another one that by the looks of it seems to be a combination of mine and some other one (setting some ECS variables).

I am not sure which one is used when the batch instances are booted, but it does not pick up any changes to my template.

reisingerf on 10 Mar 2020

@reisingerf Notice that the example above uses:

launchTemplate: {
    launchTemplateId: launch_template.ref,
    version: launch_template.attrLatestVersionNumber,
},

Which uses the current latest version (granted the name is a bit confusing).
So when you first deployed, it set the launch template version of the compute environment to a fixed number.

Since compute environments don't support modifying this, I imagine CloudFormation doesn't send the update request (though it would have been nice to get error in this case). Therefore your compute environment will always use the initial version.

To fix, you can use:

launchTemplate: {
    launchTemplateId: launch_template.ref,
    version: "$Latest", // Notice that $Default is also supported to use the default launch template version.
},

From the aws console:

Screen Shot 2020-03-10 at 11 25 51 AM

Regarding the two templates, yes, this is the expected behavior.

One template is the one you created and are maintaining. The other one is created by batch and indeed adds some variables, this is btw the reason your user data has to a MIME multi-part archive. Eventually the launch template that is used is the one batch created, but it should always contain your configuration as well, so you don't have to worry about it.

Let me know if this resolved the issue.

iliapolo on 10 Mar 2020

@iliapolo thanks a lot for the explanations! Much appreciated!

I figured that the Batch internal use of UserData would enforce the _MIME multi-part archive_ UserData format. I guess that's what caused me some headache in the beginning when I was trying to get a simple Hello world bash user data script to work.

As for the LaunchTemplate version. I found the relevant paragraph in the docs:

AWS Batch does not support updating a compute environment with a new launch template version. If you update your launch template, you must create a new compute environment with the new template for the changes to take effect

Also, you said the second LaunchTemplate is created by Batch (I assume at ComputeEnvironment creation time) and incorporates my own LaunchTemplate. I guess that's the reason why Batch does not support template updates.

If so, then I don't quite understand your suggested fix for it though. If I specify a version number (or $Latest), I would still have to recreate the ComputeEnv, as it would incorporate that version into its own LaunchTemplate on creation time (as static copy), right?

Or are you saying that if I specify$Latest as version in the CfnComputeEnvironment and that version changes, CDK will detect the change and automatically recreate the ComputeEnvironment?
And how would $Latest then differ from launch_template.attrLatestVersionNumber?

reisingerf on 10 Mar 2020

@reisingerf You are right.

I mistakenly assumed that $Latest is used as a pointer, and batch does whatever changes needed to its own launch template.

I now understand this is not the case and actually $Latest indeed does not differ from launch_template.attrLatestVersionNumber.

I even tried updating the managed template myself, in the hopes that $Latest perhaps refers to its own managed latest, but that didn't work either.

You mentioned that:

I also noticed (after quite some frustration) that my updates to my user data, seemed to be deployed, but actually did not change the user data run by the starting instances.

If you update the launch template from the CDK app, it should have also caused a re-creation of the compute environment because launch_template.attrLatestVersionNumber now evaluates to a different value, and according to this, CloudFormation would replace the compute environment and the changes should apply.

Can you double check the compute environment was indeed replaced? Note that if the environment is used by some queue (which it usually is), the replacement will fail and actually result in two environments, one pointing to the old template version, and one to the new, with the queue still connected to the old one.

It looks like the experience of updating a launch template isn't tight enough and has a few problems, both from the CloudFormation and the CDK side. I'll try to think how can we improve on that (a feature request from you will be appreciated as well :)).

In the meanwhile, the safest and most streamlined approach (all be it somewhat slow), would be to create a batch.CfnJobQueue in the CDK app and run cdk destroy && cdk deploy each time you change the launch template.

iliapolo on 10 Mar 2020

@iliapolo thanks again and understood.

My experiences match what you say. I had the issue with ending up with two compute envs and resorted to exactly your suggestion of destroy and deploy.

Any preferred way of creating that feature request? Shall I open a new ticket on this repo or try to request changes to CloudFormation via a support request?

reisingerf on 10 Mar 2020

@reisingerf good question. Since the root cause is actually batch not supporting launch template version updates, I think the best approach would be to submit a feature request in the AWS Batch Developer Forum.

In addition, you can create a general issue in this repo where we can discuss possible approaches for the CDK to help mitigate that.

Once you do that, can you please link those issues from here and close this one?

Thanks!

iliapolo on 11 Mar 2020

Done.

Forum link: https://forums.aws.amazon.com/thread.jspa?threadID=318580

GH issue: https://github.com/aws/aws-cdk/issues/6686

reisingerf on 12 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings