Terraform: provider/aws: Lambda function w/ vpc_config - destruction may fail

Created on 21 Mar 2016  ·  23Comments  ·  Source: hashicorp/terraform

Probably not a specific issue with Terraform, but wanted to open up a discussion and maybe add extra support for a feature like https://github.com/hashicorp/terraform/issues/4121

An issue exists where the destruction of a lambda function with a vpc_config block has the potential of leaving behind infrastructure that prevents the completion of a terraform destroy action. This is due to the fact that the aws lambda function creates an ENI inside the VPC passed to the config block and is outside the scope of Terraform state.

As a specific example, if a user creates the following resources in a single template:

  • aws s3 bucket
  • aws security group
  • aws lambda function

    • embedded local-exec provisioner configuring former bucket as event source to self with ['s3:ObjectCreated:_','s3:ObjectRemoved:_'] as sources.

    • vpc_config that contains subnets of a particular vpc and the security group id of the proceeding resource

  • aws s3 object

When this template is created all the resources are created as expected. In addition to that though an ENI is created through the aws lambda service which allows the lamda function to communicate with internal vpc resources.

When this template is destroyed, the s3 object is the first resource to be destroyed thus envoking the lambda function. AWS will still allow the destruction of the lamda function, but it does not clean up the ENI, presumambly because there is an instance of the function executing in parallel. The ENI has the terrafrom defined security group attached and the result is something that looks like

[automation] aws_security_group.eod_lambda: Destroying...
[automation] Error applying plan:
[automation]
[automation] 1 error(s) occurred:
[automation]
[automation] * aws_security_group.eod_lambda: DependencyViolation: resource sg-afceafc8 has a dependent object
[automation]    status code: 400, request id:
[automation]
[automation] Terraform does not automatically rollback in the face of errors.
[automation] Instead, your Terraform state file has been partially updated with
[automation] any resources that successfully completed. Please address the error
[automation] above and apply again to incrementally change your infrastructure.
Aborting due to errors in this stage

My idea as a workaround would have been to create a null_resource in the template that would execute a local-exec script that would check for leftover ENIs after the fact, but that feature is not supported due to provisioners only executing on resource creation.

bug provideaws

Most helpful comment

My 2 cents here, I had the same issue with this, I am using Lambda inside a VPC with an SG (basically my lambda function needs to do some calls to an internal ELB), so what I have done - so far it worked fine - is to update my Lambda IAM inline policy to:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DetachNetworkInterface",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DeleteNetworkInterface"
            ],
            "Resource": "*"
        }
    ]
}

I have destroyed my test environment 3 times now and the problem did not raise again, can someone test it and let me know if it checks out for you?

All 23 comments

I want add additional context around this problem. This isn't an issue in just the use case outlined above, but anytime a lambda resource with a vpc_config is created using Terraform. I have been in contact with AWS about this behavior and it is an intended design of their service to allow network ENI re-use to make the service more efficient.

Per AWS:

Due to the nature of Lambda, container reuse is to be expected from the service [1]
This includes any ENIs which are attached to the function. Reuse is emphasized in Lambda in order to minimize execution time of your functions

Knowing this makes me believe that running CRUD operations with a template that includes an aws_lambda_func resource is not reliable at this time.

I can confirm this (confusing) behaviour. The Lambda Function delete API call should IMO also cleanup any ENI it has created, which doesn't seem to be happening.

This wasn't caught by the nightly Terraform acceptance test suite because the ENI is created when the Lambda function is invoked for the 1st time, not during its creation. We are not invoking Lambda functions as part of those tests. If we did, these would have failed due to inability to delete security group.

I have been in contact with AWS about this behavior and it is an intended design of their service to allow network ENI re-use to make the service more efficient.

I would understand the need for ENI reuse by the same Lambda function as it "scales" up/down, but not across different Lambda functions.

Sanity check - IAM permissions

Documentation

I was re-reading through related documentation, e.g. http://docs.aws.amazon.com/lambda/latest/dg/vpc.html
which recommends ec2:CreateNetworkInterface, ec2:DescribeNetworkInterfaces, ec2:DeleteNetworkInterface permissions in the IAM Role but Lambda still doesn't delete ENI(s) on lambda:DeleteFunction.

Stack Overflow

Then I also found this SO thread which is making the permissions wider.
Some of the permissions would make sense - e.g. it may need the permission to ec2:DetachNetworkInterface too. However even though I gave the Lambda IAM Role permissions according to that SO thread it didn't delete its ENI(s) on lambda:DeleteFunction.

Serverless GH issue

I also read this thread https://github.com/serverless/serverless/issues/629 which recommends using the AWS' managed policy AWSLambdaVPCAccessExecutionRole. That still didn't make lambda:DeleteFunction delete the associated ENIs.


Here's the full repro case I will share with AWS in the form of AWS-CLI based shell script:
https://gist.github.com/radeksimko/0cc2152e72efa4d1ae4d689ae29fbce8

Possible solutions/workarounds

Here's an example of ENI created by Lambda Function on Invoke:
https://gist.github.com/radeksimko/e8d9176f12e20f521322103f607b6c24

We might be able to make Terraform cleanup these ENIs as long as we're able to reliably identify the right ones - i.e. we must avoid deleting any other ENIs.

We would need to be able to identify ENIs that have been created by a specific Lambda Function ARN/name+region.

Just hit this same issue. Has anyone found a workaround that allows terraform destroy to work?

@brikis98 I'm afraid the only workaround for now is to go and identify the ENIs manually and detach & delete those.

I'm still waiting for response from AWS, hopefully they will shed some light on this.

@brikis98

My solution to this is a bit unorthodox but I have a destruction method in my lambda function that deletes its own ENI. Once there is an attached ENI to your lambda function, that becomes the default interface for all external comms so that method needs to be the last function called.

Essentially I look for a particular key in the event source to determine if this is related to the destruction of the lambda function. If so, I proceed to do a lookup of all ENIs with the security group id attached to the ENI. I iterate over that result set and issue a detach and destroy call with boto3. It has been reliable for me since. I also want to mention I have a wrapper around terrafrom that has its own retry and wait functionality that helps with the timing of the destruction of resources as well.

@rbachman Would you mind sharing any the details about how are you identifying the right ENIs?

i.e. which fields from https://gist.github.com/radeksimko/e8d9176f12e20f521322103f607b6c24 are the right ones for reliable identification?

@rbachman Thanks for the info. If you're able to share any of the code, that would be really helpful!

Hitting this as well. My use case is writing an ETE test for a lambda module. Trying to create something that will startup a VPC, create the function, verify that it passes the test, and then destroy the environment. I was considering following @rbachman 's strategy of trying to script around the issue. Would be great if you could post your solution

Is it the case that the ENI is _never_ cleaned up, or just that it takes some time to go away?

I've noticed with other AWS services (RDS, ELB, etc) that the ENIs do get cleaned up when they are deleted but it can sometimes take some additional seconds for this to happen. Lambda leaving ENIs around indefinitely would be very unfortunate even ignoring Terraform's problems, because it could eventually lead to the VPC IP address space being depleted.

All of this is to say... could we hide this problem by implementing retries on the destruction of aws_security_group when it gets the error DependencyViolation? Of course we wouldn't want to keep retrying forever because there might legitimately be a network interface still active in the security group, but we could retry for maybe 15 seconds before giving up if Lambda does tend to free these after a short period.

Is it the case that the ENI is never cleaned up, or just that it takes some time to go away?

It may take time and it may not be just seconds.

I've noticed with other AWS services (RDS, ELB, etc) that the ENIs do get cleaned up when they are deleted but it can sometimes take some additional seconds for this to happen

AFAIK ENIs for Lambda are managed slightly differently to RDS & ELB.

All of this is to say... could we hide this problem by implementing retries on the destruction of aws_security_group when it gets the error DependencyViolation?

I was talking to AWS about this. I can't share the Lambda implementation details here and explain why is this happening (#nda 😢 ), but I can assure you that standard retry logic won't save us in this specific case, or at least won't cover as many cases as you'd hope.

The right patch/workaround is most likely going to involve catching DependencyViolation then searching for ENIs which have that specific SG ID and detaching that SG from the ENI.

I still need to verify whether the ENI is going to be cleaned up after it's modified or whether it's better to remove that ENI completely.

Also deleting subnets is probably not something that people do on daily basis, but they might bump into the same issue there - be unable to delete subnet because it has unmanaged ENIs attached.

btw. I believe this is affecting CloudFormation too.

I've ran into this exact error as well.

I'm thinking about ways to tackle this and wanted to get some thoughts from you @radeksimko. From my understanding the ENI can potentially be shared inside the VPC between multiple lambda functions that exist inside the VPC..... so, trying to clean these up inside the lamda provisioner might not make sense. Seems to make sense to clean them up in the VPC provisioner.

When lambda creates an ENI, I've seen that it set's the description to something like 'AWS Lambda VPC ENI: XXX' where XXX is some GUID. I was thinking about adding some logic in the VPC provisioner to do the following on destroy:

If no lamdba functions exist inside VPC:
    identify all lingering ENIs (using the description) and destroy them

This doesn't feel quite right, but honestly it's the best solution I can think of. Thoughts? Other ideas?

@gposton That's similar to a solution I was thinking about. The DependencyViolation is raised when deleting a security group, not a VPC.

For a _clean workaround_ (how crazy it sounds? 😄 ) I think we should do two things:

  • let aws_security_group Destroy() look up all ENIs that have the SG ID attached, loop through those and detach them all. Theoretically we could also remove any ENIs that only had a single SG ID attached as it's clear that this ENI won't be reused by any other Lambda function (otherwise it would be serious security flaw).
  • let aws_subnet Destroy() look up all ENIs that look like the ones created by Lambda for the given subnet ID and detach + delete those.

Theoretically the first workaround should be sufficient for any Lambda functions that have at least 1 SG attached, the second one is for functions with no SGs.

i.e. I would try to keep the workaround as close to relevant code/resource as possible.

@radeksimko, sounds solid.. thx for the feedback. From my testing it looks like detaching an ENI doesn't remove the security group, so there's still a dependency violation. Point being we'll have to detach AND delete from both the security group and aws_subnet resources. Hopefully I'll have a PR soon.

From my testing it looks like detaching an ENI doesn't remove the security group

I would assume that's by design (i.e. expected and correct behaviour).
When I said "detaching" I rather meant detaching security group from the ENI, not detaching the ENI.

This is what I think needs to happen in aws_security_group.Destroy (pseudo code):

  1. ec2:DescribeNetworkInterfaces(filter = {group-id=d.Id(), Description="AWS Lambda VPC ENI:*" })
  2. Loop through given ENI ids and decide:
  3. Finally call ec2:DeleteSecurityGroup

I believe we should keep failing if the ENI was modified outside of Terraform and contains SGs that Terraform isn't aware of in that context and we should not be detaching nor deleting any ENIs that don't look like ENIs managed by Lambda.


Re deleted: we wouldn't actually fail in that context because we're deleting SG, not the ENI... so that's 👌

@radeksimko can you take a look at my PR: https://github.com/hashicorp/terraform/pull/8033

Note: This only addresses the cleanup of ENIs created by lambda functions when a security group is associated w/ the function.

My 2 cents here, I had the same issue with this, I am using Lambda inside a VPC with an SG (basically my lambda function needs to do some calls to an internal ELB), so what I have done - so far it worked fine - is to update my Lambda IAM inline policy to:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkInterface",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DetachNetworkInterface",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DeleteNetworkInterface"
            ],
            "Resource": "*"
        }
    ]
}

I have destroyed my test environment 3 times now and the problem did not raise again, can someone test it and let me know if it checks out for you?

Hmm. @mgagliardo did you execute your lambda functions (creating the ENI) before running the delete? I duplicated your policy, but still hit the issue upon running destroy.

Hi @anoldguy my lambda function is being called everytime an instance from an ASG is being terminated (ASG -> SNS -> Lambda), and as my environment is very volatile now the Lambda function is being called several times.

I have tried it again and saw it failed, so the only explanation I have for this is that after a while the lambda function actually removed some of the ENIs (so going back to t=0 for Lambda) and then is where I can remove the ENIs.

Sorry for the false expectation :(

I ended up here after being redirected from this thread => https://forums.aws.amazon.com/thread.jspa?messageID=756642

It turns out (in my case) that ENI resources that are not used by lambda get automatically cleaned up (so long as the lambda has ec2:DeleteNetworkInterface perms).

But the main caveat is that IT DOESNT HAPPEN RIGHT AWAY. In my case it took a few hours for the ENI's to dissappear.

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Was this page helpful?
0 / 5 - 0 ratings