Up until end of last week (aka until Sep 6th), our code(s) worked like a charm for creating and deleting aws_lambda_function(s) all based in eu-west-1 region with 3 zones (a,b,c) VPC and associated security group.
Starting today, the deletion is failing due to timeout on deleting security group with DependencyViolation and we can see that the 3 network interfaces created to be used by the lambda in the 3 zones are not deleted still in 'in-use' therefore preventing the proper deletion of the security group. After a while (~15mn) they go into in 'available' state,
at which time we can manually delete them which also allow to re-run terraform destroy successfully.
Looking at the logs in debug mode does not exhibit anything worth mentioning here to help with this...
This is reproductible 100% of the time and our code base did not change for several weeks now.
Any one else having the same kind of issue, advice, ... ?
@obourdon This could be related to the recently announced improved VPC networking for AWS Lambda functions. Lambdas now share a pool of ENIs (where appropriate) so the idea that the lambda effectively owns the associated ENI is invalid.
There were no AWS API changes for this, AWS are rolling out globally over the next couple of months.
Reading that AWS announcement it's not clear to me when the ENI(s) associated with a _Subnet_, _Security Group_ pair get deleted.
From that blog post:
If Lambda functions in an account go idle for consecutive weeks, the service will reclaim the unused Hyperplane resources
but no mention of deleted lambdas.
Same issue here
+1 same issue for me
deleteLingeringLambdaENIs()
added in https://github.com/hashicorp/terraform/pull/8033, https://github.com/hashicorp/terraform/pull/8486.
@ohuez @luthor2016ad Which AWS region(s) are you working in?
@ewbankkit many thanks for this info. I managed to get deeper into Terraform traces and also used CloudTrail to try to figure out what is wrong.
What I was able to diagnose properly so far right now is that I have an ELBV2 resource (Application Load Balancer) that gets removed and I see:
DetachNetworkInterface Client.AuthFailure You do not have permission to access the specified resource.
on the 3 interfaces attached to it (eni-attach-xxxx) within the CloudTrail entries.
Again, code, permissions, roles, ... nothing changed on our side since this happened (I triple made sure of this also today) so this is either on AWS side or Terraform provider for AWS.
I also took some AWS cli JSON output before deletion for network interfaces, elbs, asgs, ... so that I can sync what I see in CloudTrail output, Terraform TRACE logs and others.
As for the other errors due to lambdas, I am still analyzing further before adding more comments.
BTW, forgot to mention that we are running the latest available version of Terraform AWS provider (2.27.0) and 0.11.14 for Terraform itself and again this was working even using former versions some time back (2.11.0 and 0.11.1 back in late may 2019)
@ewbankkit We are working in eu-west-1.
Like @obourdon, nothing changed on terraform side (nor resources definition or versions used).
We are also using the latest aws provider version (2.27.0).
I get the same error when running the lambda VPC acceptance tests in eu-west-1
:
$ AWS_DEFAULT_REGION=eu-west-1 make testacc TEST=./aws TESTARGS='-run=TestAccAWSLambdaFunction_VPC'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws -v -parallel 20 -run=TestAccAWSLambdaFunction_VPC -timeout 120m
=== RUN TestAccAWSLambdaFunction_VPC
=== PAUSE TestAccAWSLambdaFunction_VPC
=== RUN TestAccAWSLambdaFunction_VPCRemoval
=== PAUSE TestAccAWSLambdaFunction_VPCRemoval
=== RUN TestAccAWSLambdaFunction_VPCUpdate
=== PAUSE TestAccAWSLambdaFunction_VPCUpdate
=== RUN TestAccAWSLambdaFunction_VPC_withInvocation
=== PAUSE TestAccAWSLambdaFunction_VPC_withInvocation
=== CONT TestAccAWSLambdaFunction_VPC
=== CONT TestAccAWSLambdaFunction_VPCUpdate
=== CONT TestAccAWSLambdaFunction_VPC_withInvocation
=== CONT TestAccAWSLambdaFunction_VPCRemoval
--- FAIL: TestAccAWSLambdaFunction_VPC (1257.19s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: 2 problems:
- Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)
- Error deleting security group: DependencyViolation: resource sg-0fc6365b91ab5362f has a dependent object
status code: 400, request id: 0225bc3a-d43d-4a99-bee0-e2e0065f06dd
FAIL
FAIL github.com/terraform-providers/terraform-provider-aws/aws 1295.431s
make: *** [testacc] Error 1
The same tests runs fine in the default acceptance test region (us-west-2
):
$ make testacc TEST=./aws TESTARGS='-run=TestAccAWSLambdaFunction_VPC'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws -v -parallel 20 -run=TestAccAWSLambdaFunction_VPC -timeout 120m
=== RUN TestAccAWSLambdaFunction_VPC
=== PAUSE TestAccAWSLambdaFunction_VPC
=== RUN TestAccAWSLambdaFunction_VPCRemoval
=== PAUSE TestAccAWSLambdaFunction_VPCRemoval
=== RUN TestAccAWSLambdaFunction_VPCUpdate
=== PAUSE TestAccAWSLambdaFunction_VPCUpdate
=== RUN TestAccAWSLambdaFunction_VPC_withInvocation
=== PAUSE TestAccAWSLambdaFunction_VPC_withInvocation
=== CONT TestAccAWSLambdaFunction_VPC
=== CONT TestAccAWSLambdaFunction_VPC_withInvocation
=== CONT TestAccAWSLambdaFunction_VPCUpdate
=== CONT TestAccAWSLambdaFunction_VPCRemoval
--- PASS: TestAccAWSLambdaFunction_VPC (54.10s)
--- PASS: TestAccAWSLambdaFunction_VPCRemoval (75.89s)
--- PASS: TestAccAWSLambdaFunction_VPCUpdate (79.97s)
--- PASS: TestAccAWSLambdaFunction_VPC_withInvocation (87.53s)
PASS
ok github.com/terraform-providers/terraform-provider-aws/aws 87.608s
The ENI attachments in the first case are owned by amazon-aws
:
whereas in the second case they are owned by aws-lambda
:
I can't manually delete the security groups and subnets created during the acceptance tests as they are in use by those same ENIs:
@ewbankkit thanks a lot for reproducing this. In the meantime I may have found the cause of the issue and I am currently testing a trial fix in my environment. Should have the result pretty soon so please stay tuned
Part of the issue is contained in this line of code in the AWS provider
The descriptions of the attached network interfaces are:
AWS Lambda VPC ENI-<LAMBDA-FN-NAME>-<UUID>
which obviously do not match the pattern
AWS Lambda VPC ENI: *
Fixing this and using
AWS Lambda VPC ENI*
in my test AWS provider does allow to go a bit further as I now see traces of deletion tentative on the interfaces. However, this is not sufficient because the interfaces were created using the role (and 'fake' user) attached to the lambda whereas the deletion in this part of the code is done via my own terraform user therefore the error messages:
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: 2019/09/11 15:25:52 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DetachNetworkInterface Details:
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: ---[ RESPONSE ]--------------------------------------
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: HTTP/1.1 400 Bad Request
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: Connection: close
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: Transfer-Encoding: chunked
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: Date: Wed, 11 Sep 2019 15:25:51 GMT
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: Server: AmazonEC2
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4:
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4:
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: -----------------------------------------------------
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: 2019/09/11 15:25:52 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2019-09-11T15:25:52.168Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: <Response><Errors><Error><Code>OperationNotPermitted</Code><Message>You are not allowed to manage 'ela-attach' attachments.</Message><\
/Error></Errors><RequestID>3ff82d8b-87ce-4717-b8c0-bddb6d5aedb1</RequestID></Response>
2019-09-11T15:25:52.169Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: 2019/09/11 15:25:52 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/DetachNetworkInterface failed, attempt 0/25, error OperationNotP\
ermitted: You are not allowed to manage 'ela-attach' attachments.
2019-09-11T15:25:52.169Z [DEBUG] plugin.terraform-provider-aws_v2.27.0_x4: status code: 400, request id: 3ff82d8b-87ce-4717-b8c0-bddb6d5aedb1
As far as my current provider knowledge is concerned, I have no clue on how to get further on this.
I created an AWS developer forum post: https://forums.aws.amazon.com/thread.jspa?messageID=915634󟢲.
@obourdon Can you open an AWS Technical Support case for this?
@ewbankkit will try to.
How can I remove the question label on this github issue entry and replace it with a bug/issue one instead ?
@ewbankkit I have updated my AWS Tech Support case with latest information from this issue
@bflad thanks for retagging this
We are experiencing the same issue since monday september 9th
@plum117 @luthor2016ad could you please click the thumbs up (+1) at the bottom of the top comment. Many thanks in advance
@ewbankkit how do you cleanup the AWS provider acceptance test resilient AWS resources after failure ? Seems like I can not delete anything remaining in the AWS UI
Here is AWS answer to the support case:
"to answer your concern, the issue that you're seeing might be a result of the new improved
VPC networking for AWS Lambda functions that is being rolled in the regions where the ENIs that are created by AWS Lambda in your VPC get orphaned if the IAM role gets deleted, since AWS Lambda owns the ENI but does not have access to IAM role to delete/detach ENI. AWS Lambda creates these ENIs using the IAM role associated with the lambda function. These ENIs are owned by AWS Lambda in your account but are managed from the IAM initially provided. If the IAM role gets deleted, then lambda fails to delete/detach the ENI it originally created."
in the case of provider acceptance case items, I am affraid we have to fall into the following:
"To fix this issue we'll have to reach out to our internal team with the specific resources that you're seeing this issue with and then release the resources so that you can delete them manually. That being said, if you're still having issues while deleting any ENIs then please provide us the IDs so that we can open an internal request with our teams to look into them further. " :-(
@obourdon The standard way to clean up after acceptance testing failures is via test sweepers but be aware that they are very indiscriminate about destroying resources so please only run in an AWS account that is dedicated to testing.
For example:
$ go test ./aws -v -sweep=eu-west-1 -sweep-run= aws_lambda_function,aws_security_group
But in this particular case I don't think there's anything that can be done except wait for AWS to reclaim the ENIs after the indeterminate time period and then cleanup manually.
Given what we know, I think what we can do is:
aws_lambda_function
documentation describing this situationdeleteLingeringLambdaENIs
to only attempt to detach and delete lambda ENIs that have an attachment owner of aws-lambda
TestAccAWSLambdaFunction_VPC
acceptance tests to use the testing region's default VPC, subnets and security group (if possible) so that no attempt is made to create (and then destroy) a VPC, subnets and security group@ewbankkit thanks for the suggestions
On my side I have tried several things which did not work including modifying the provider code to force ENIs deletion when calling lambda delete and not SG delete. Did not work even though the ENI owner seems to me my AWS ID and the lambda role still being present. Therefore the selective deletion you are suggesting above will not work either unfortunately
I have also tried to work around the issue by putting a timeout delete on security group with a value of 25mn (because as I stated above, I noticed that after 20mn or so the interfaces changed from in-use
to available
status) and patching the provider code to take this into account but I fell into some other provider/sdk pitfalls. In fact, right after we try to detach the interface and get an error with
isAWSErr(detachNetworkInterfaceErr, "OperationNotPermitted", "You are not allowed to manage 'ela-attach' attachments") {
asking for the status of this interface via conn.DescribeNetworkInterfaces(p)
with p parameter stating that NetworkInterfaceIds is the singleton of the interface ID, then we get a bad return of available
instead of in-use
(may be due to caching or so). Waiting a bit more (~1m50s) the state comes back to effective in-use
value and of course after the 20m0s delay we also get the proper available
.
May be if someone knows if this timeout can somehow be reduced when creating the lambda, it would also greatly help. (it might probably be configurable as the lingering interfaces I had when passing the provider acceptance test were different [over 24h])
Any insight from AWS and/or provider experts very welcome
@obourdon You are correct, in further testing I notice that the lingering Hyperplane ENIs seem to change from in-use
to available
not long after the 20 minute delete timeout trips.
Let me see what happens if I change that timeout to let's say 30 minutes and explictly wait for the available
state for that amount of time.
It looks like we can use a ContinuousTargetOccurence
value of say 10 to make sure that we get a consistent read on the ENI state.
All be aware that doing a lot of testing on this, seems like I have hit another AWS limit:
1 error occurred:
* aws_lambda_function.db_manager: 1 error occurred:
* aws_lambda_function.db_manager: Error creating Lambda function: InvalidParameterValueException: Eni Limit Exceeded, current eni count: 100, limit: 100
status code: 400, request id: c7c5825d-6325-4abe-bcd2-fc741efa0cff
Not sure how long it will take to reset this and how it is accounted for because I do not have 100 enis currently defined in //
@ewbankkit I have a working fix that I am currently checking for my own issue and will also try to pass the acceptance tests in the different regions to see if it does not break anything
Will post the PR once all lights are green
@ewbankkit & others
I have started to work on a fix in the following branch and seems to be working fine for my case.
However, running provider acceptance test in the eu-central-1 zone where Lambda exhibits the issue (like in eu-west-1) I ran into errors. The first one is that sometimes the eni interface just disapears after some time:
--- FAIL: TestAccAWSLambdaFunction_VPCUpdate (1185.78s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: 2 problems:
- Failed to delete Lambda ENIs: Error waiting for ENI (eni-0bbab8b33e70df022) to become detached: Improper number of interfaces returned: 0
- Failed to delete Lambda ENIs: Error waiting for ENI (eni-0f8c10a3fb8fb2265) to become detached: Improper number of interfaces returned: 0
...
--- FAIL: TestAccAWSLambdaFunction_VPC (1186.27s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: Failed to delete Lambda ENIs: Error waiting for ENI (eni-0210f49ef013ee00b) to become detached: Improper number of interfaces returned: 0
Therefore I tried the code at this line (without the false && statement of course) but then I run into some other issues:
--- FAIL: TestAccAWSLambdaFunction_VPCUpdate (1173.52s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: 3 problems:
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-06c9788a85b0ab6bc' is currently in use.
status code: 400, request id: eb761c39-e2b1-495d-86f8-32e176a29e4d
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-06c9788a85b0ab6bc' is currently in use.
status code: 400, request id: 5607a9a0-da97-46ea-87db-89d9a3f9ce6c
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-06c9788a85b0ab6bc' is currently in use.
status code: 400, request id: ee5cdfd6-a2e7-493c-ad46-5cf34ff29c3b
...
--- FAIL: TestAccAWSLambdaFunction_VPCRemoval (1252.33s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: 2 problems:
- Failed to delete Lambda ENIs: Error waiting for ENI (eni-06ae40dc99f67dcf3) to become detached: timeout while waiting for state to become 'true' (last state: 'false', timeout: 20m0s)
- Failed to delete Lambda ENIs: Error waiting for ENI (eni-06ae40dc99f67dcf3) to become detached: timeout while waiting for state to become 'true' (last state: 'false', timeout: 20m0s)
Any help greatly appreciated.
I may be hitting several problems at once but it seems like the return of describe interface I use in the code stated above does not (always) return consistent result. Retrying the code in the same eu-central-1 region now exhibits new errors :-(
--- FAIL: TestAccAWSLambdaFunction_VPCUpdate (57.52s)
testing.go:630: Error destroying resource! WARNING: Dangling resources
may exist. The full state and error is shown below.
Error: errors during apply: 4 problems:
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-01c989935c3fc03d9' is currently in use.
status code: 400, request id: cfcb37c2-184a-4fc2-b8fb-7869a005b193
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-01c989935c3fc03d9' is currently in use.
status code: 400, request id: cb630177-2cf9-4c07-8cb4-3847bc281ace
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-0654afc17b54a471e' is currently in use.
status code: 400, request id: 2babf9cc-3c11-4c32-b811-38e6e9710919
- Failed to delete Lambda ENIs: InvalidParameterValue: Network interface 'eni-01c989935c3fc03d9' is currently in use.
status code: 400, request id: abe1e89a-ff93-45f2-a7b6-4ab1a529f452
The other tests I have made in eu-west-1 for the same issue seems to work but I can not launch the acceptance test in this region because I have my work currently running in this region and I hit another AWS limitation:
--- FAIL: TestAccAWSLambdaFunction_VPC_withInvocation (17.42s)
testing.go:569: Step 0 error: errors during apply:
Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
status code: 400, request id: 4d48d4b5-4230-41ad-ac01-c88dbcb343b4
on /var/folders/k0/35x3347j4s18t_h4z932f3nr0000gn/T/tf-test181935921/main.tf line 75:
(source code not available)
--- FAIL: TestAccAWSLambdaFunction_VPC (17.43s)
testing.go:569: Step 0 error: errors during apply:
Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
status code: 400, request id: 7d780b85-b675-48aa-9958-cd2195f66e83
on /var/folders/k0/35x3347j4s18t_h4z932f3nr0000gn/T/tf-test174044968/main.tf line 75:
(source code not available)
--- FAIL: TestAccAWSLambdaFunction_VPCUpdate (17.43s)
testing.go:569: Step 0 error: errors during apply:
Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
status code: 400, request id: 52059c45-3a02-4ff9-bd05-e215275cc7af
on /var/folders/k0/35x3347j4s18t_h4z932f3nr0000gn/T/tf-test490668698/main.tf line 75:
(source code not available)
--- FAIL: TestAccAWSLambdaFunction_VPCRemoval (17.52s)
testing.go:569: Step 0 error: errors during apply:
Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
status code: 400, request id: 0214b61c-32a1-4374-b1b1-c9720a5d10b2
on /var/folders/k0/35x3347j4s18t_h4z932f3nr0000gn/T/tf-test914715239/main.tf line 75:
(source code not available)
FAIL
FAIL github.com/terraform-providers/terraform-provider-aws/aws 17.576s
@obourdon To get around all those account limit errors you will have to manually delete the ENIs that are now in available
state and then manually delete the VPCs that were created for the acceptance tests.
I have created a _WIP_ PR, https://github.com/terraform-providers/terraform-provider-aws/pull/10114, that waits for the security group/subnet delete timeout for the ENI to change from in-use
to available
for these Hyperplane ENIs and I changed that delete timeout to 30 minutes as the ENIs take 20 minutes to make that state transition. However, because of some issue (maybe related to the fact that both these resource types have state migration functions) this new timeout is not being picked up and the default of 20 minutes is being used, still causing the acceptance test to fail with a timeout.
I'm also unable to override that default if I add a
timeouts {
delete = "30m"
}
block to all the security group and subnet resources in the lambda acceptance tests.
@obourdon @ewbankkit FYI i escalated from my side to AWS support ang got this feedback :
From a couple of tests ran by team it could be seen that:
- If You delete the function but not the role, the ENIs go to available status (after ~20mn) and are cleaned up automatically shortly after.
- If You delete the function AND the role, the ENIs go to available status (after ~20mn) but are not cleaned up. You will be then able to delete them manually.
If Terraform is cleaning up the role before the ENIs are deleted they will never be cleaned up. This will need to be fixed on Terraform side. You could wait for the ENI to be cleaned up automatically after the function is deleted and then delete the role, or wait for the ENI to be in available status and delete it before deleting the security group.
So, for now you will need to delete the available ENIs manually to be able to delete the security group.
Hope that could help for provide a fix
@JulienChampseix Thanks for chasing this up.
It seems like the approach of increasing the subnet and security group delete timeout to 30 minutes and waiting for the ENI to come into available
state and then detaching the ENI before deletion (of course handling the already detached case) make sense.
@ewbankkit @JulienChampseix this definitely makes sense. However I still have hard time understanding this "new" 20mn (may be not so useful) timeout mechanism and if it could be customised when creating the resources using terraform.
@obourdon from my own research and understanding of what AWS has done recently, AWS changed the Lambda networking inside a VPC.
What has not changed:
What has changed:
Lambda service now creates a NAT between the Lambda VPC and your VPC using AWS Hyperplane:
the Network Function Virtualization platform used for Network Load Balancer and NAT Gateway, has supported inter-VPC connectivity for offerings like AWS PrivateLink, and we are now leveraging Hyperplane to provide NAT capabilities from the Lambda VPC to customer VPCs.
Network interfaces in your VPC are mapped to the Hyperplane ENI and the functions connect using it. As a result, ENIs are not multiplied anymore as the execution environment scales;
From this announcement from AWS
Is it so useful?
Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
This avoids proliferation of ENIs from Lambdas that could cause depletion of available IPs in your subnet (see https://github.com/hashicorp/terraform/issues/5767)... And skips creation of a new ENI for all necessary additional instance of Lambda with the same security group:subnet.
A side effect is that you aren't allowed to detach your network interface as easily anymore... I guess the NAT created by AWS Hyperplane plays a role in this impossibility to detach the interface.
The only solution I found for now is to poll the network interface Attachment status every 5 minutes. Once this status is 'available', it should be possible to remove the interfaces and delete the associated security group.
@antoninilouis many thanks for this very detailed and useful dig up
@ewbankkit I tried to find a way to use terraform 0.11.x for running acceptance tests but the only way I found so far is to modify go.mod but it does not work. Whatever the version I put there, it is overwritten with 0.12.6.
Unsuccessfully tried to find an alternative way but found nothing worth. Could you please give me some advice ? Many thanks in advance
seems like I found the proper way to do it. Thanks anyways
Spoke to soon, I found about some commands in git log go.mod and I can generate my provider against terraform 0.11.x but not for acceptance testing ...
Hi Folks - apologizes for the delayed response here. I've been following the thread and looking into getting out a fix for the upstream changes. @obourdon you mentioned a fix in the works if you want to open a PR with what you have we might be able to collaborate on it. I'm currently looking into confirming the timeout issues mentioned above. I will update this thread shortly with more details.
Thanks to everyone for your help in triaging this issue and working towards a fix.
@nywilken many thanks for looking into this. On my side I have posted my current changes in a branch of my own but so far I still have some issues that I am trying to solve.
I have also tried the WIP PR provider by @ewbankkit but again there are some (other) remaining issues.
Will post progress also on my side as soon as possible
@obourdon Please feel free to cherry pick my commit in the WIP PR and adapt, if it helps.
@ewbankkit this is what I have done already, currently trying to mix both for the final solution
Thanks for the inspiring work
@nywilken @ewbankkit one issue I am still having seems to be due to a strange "timing" in one of the acceptance tests.
In fact, I have noticed that there is a destroy phase which is called while one of the underlying ENI was about to move from available to in-use. I am currently trying to figure out which test is involved and if this can occur in the other regions where lambda mechanism has not been changed
@obourdon @ewbankkit thanks again for the help with this issue. After testing and looking at the changes on @obourdon branch I see three issues two of which pertain to the inability to detach AWS managed ENIs.
Please note that not all solutions listed below have been tested (mainly solution for issue 3). I am sharing so that we can level set on what we are seeing the issues being and possible ways forward.
Feel free to call out any gaps in my representation of the issues or gaps in my thinking.
Issue 1: Describe Network Interface Filters requires a slight change to the filter value in order to identify the attached ENIs. Failure to do so results in an error to delete the security and subnets due to dependency errors.
Solution:
Updating the filter string solves this problem.
Issue 2: Unable to detach AWS managed ENIs. After updating the filter description value the identified ENIs throw an error when calling DeteachNetworkInterface. AWS will automatically detach the ENI after a set time interval which appears to be over the 20 minute mark.
Once the ENI is put into the available state the Security Group and associated subnet are safe to delete.
Possible solutions:
1. Increase the Delete Timeout to 30m, Add logic in deleteLingeringLambdaENIs
to wait for the attached ENI(s) to go into the available state (or automatically detached from the security group) before returning to allow for the successful deletion of the security group and the subnet.
* Adds an additional wait time to the Terraform destroy run (20+ minutes)
deleteLingeringLambdaENIs
to continue onto the next ENI if the one currently visited is an AWS managed (AttachmentId begins with ela-attach). This will cause the security group to retry on DependencyViolation errorsOpen question: in the case where one lambda function is associated to multiple security groups/subnets will all ENIs become available at same time?
Issue 3: Shared ENIs will not become available. More specifically if there is a case where two lambda functions are using the same security group/subnet combination then the attached ENI may not become available if the two lambda functions are being manged by sepearate configurations.
Possible Solution:
Catch the ENI still in-use
timeout error and use that as an indicator that the ENI is used by some other lambda function. Skip the security group and subnet deletion calls and return no errors in the respective delete functions.
@nywilken many thanks for this very detailed information. Please note also that as stated by @ewbankkit in the following comment and associated issue modifying delete timeouts does not work.
However I am experiencing some other weird cases like I mentioned here. Any insights on this ?
Furthermore, I ran another test yesterday evening (~6PM CEST) (aws provider acceptance test suite for lambdas) with a lot of traces added in the provider code and seems like one of the created interface is still in use with no lambda associated to it.
BTW where does the 20+ minutes timeout come from and is there a way to modify this at lambda creation time ? Does it make sense to wait that long for a resource to be lingering after resources using it being destroyed ?
Just for being more specific, most of these issues are happening in the eu-west-1 zone and do not seem to be 'consistent' across failing zones. eu-central-1 does not exhibit the same behaviour for instance even if it is also part of the failing zones where AWS lambda mechanism seems to have been upgraded.
@ewbankkit is it possible (and how) to run acceptance tests with terraform 0.11.x and not 0.12.x ?
@obourdon There were no AWS API changes made for the _improved VPC networking_, so no way to change that 20 minutes which is an Amazon-decided value.
It looks like the issue with changing the Terraform resource-level timeout to 30 minutes successfully in the acceptance tests is being addressed by https://github.com/hashicorp/terraform/pull/22837.
I will try to pull in the commit from that PR into a private build of my PR tomorrow and see what happens in us-west-2
, eu-west-1
and eu-central-1
.
Please note that integration of PR#10165 might also have an impact on final results (@ewbankkit)
@obourdon Good catch, thanks. I don't think that missing return
effects any of the cases we were looking at but it may explain other weird errors.
@ewbankkit fully agree
BTW, did you see the last part in one of my latest post ?
Seems like I finally got very very close to a working fix for this issue: see my updated branch. Note that there are currently a lot of additional and personal traces which helped me figure out what could be wrong.
I have combined it with the missing return PR, a workaround for the delete timeout
It passed the acceptance tests in:
i=2 ; z=eu-central-1 ; for z in $z ; do (echo $z ; TF_LOG_PATH=../RES/$z/RES$i/olivier-traces-"$z".log TF_LOG=DEBUG AWS_DEFAULT_REGION=$z AWS_PROFILE=dev gmake testacc TEST=./aws TESTARGS='-run=TestAccAWSLambdaFunction_VPC') 2>&1 | ts | tee RES/$z/RES$i/res$i.log ; done
[2019-09-20 11:35:22] eu-central-1
[2019-09-20 11:35:22] ==> Checking that code complies with gofmt requirements...
[2019-09-20 11:35:27] TF_ACC=1 go test ./aws -v -parallel 20 -run=TestAccAWSLambdaFunction_VPC -timeout 120m
[2019-09-20 11:35:40] === RUN TestAccAWSLambdaFunction_VPC
[2019-09-20 11:35:40] === PAUSE TestAccAWSLambdaFunction_VPC
[2019-09-20 11:35:40] === RUN TestAccAWSLambdaFunction_VPCRemoval
[2019-09-20 11:35:40] === PAUSE TestAccAWSLambdaFunction_VPCRemoval
[2019-09-20 11:35:40] === RUN TestAccAWSLambdaFunction_VPCUpdate
[2019-09-20 11:35:40] === PAUSE TestAccAWSLambdaFunction_VPCUpdate
[2019-09-20 11:35:40] === RUN TestAccAWSLambdaFunction_VPC_withInvocation
[2019-09-20 11:35:40] === PAUSE TestAccAWSLambdaFunction_VPC_withInvocation
[2019-09-20 11:35:40] === CONT TestAccAWSLambdaFunction_VPC
[2019-09-20 11:35:40] === CONT TestAccAWSLambdaFunction_VPC_withInvocation
[2019-09-20 11:35:40] === CONT TestAccAWSLambdaFunction_VPCUpdate
[2019-09-20 11:35:40] === CONT TestAccAWSLambdaFunction_VPCRemoval
[2019-09-20 11:57:45] --- PASS: TestAccAWSLambdaFunction_VPC (1325.49s)
[2019-09-20 11:57:45] --- PASS: TestAccAWSLambdaFunction_VPCRemoval (1325.51s)
[2019-09-20 11:57:49] --- PASS: TestAccAWSLambdaFunction_VPCUpdate (1328.69s)
[2019-09-20 11:57:55] --- PASS: TestAccAWSLambdaFunction_VPC_withInvocation (1334.85s)
[2019-09-20 11:57:55] PASS
[2019-09-20 11:57:55] ok github.com/terraform-providers/terraform-provider-aws/aws 1334.910s
i=2 ; z=eu-west-1 ; for z in $z ; do (echo $z ; TF_LOG_PATH=../RES/$z/RES$i/olivier-traces-"$z".log TF_LOG=DEBUG AWS_DEFAULT_REGION=$z AWS_PROFILE=dev gmake testacc TEST=./aws TESTARGS='-run=TestAccAWSLambdaFunction_VPC') 2>&1 | ts | tee RES/$z/RES$i/res$i.log ; done
[2019-09-20 13:01:10] eu-west-1
[2019-09-20 13:01:10] ==> Checking that code complies with gofmt requirements...
[2019-09-20 13:01:14] TF_ACC=1 go test ./aws -v -parallel 20 -run=TestAccAWSLambdaFunction_VPC -timeout 120m
[2019-09-20 13:01:28] === RUN TestAccAWSLambdaFunction_VPC
[2019-09-20 13:01:28] === PAUSE TestAccAWSLambdaFunction_VPC
[2019-09-20 13:01:28] === RUN TestAccAWSLambdaFunction_VPCRemoval
[2019-09-20 13:01:28] === PAUSE TestAccAWSLambdaFunction_VPCRemoval
[2019-09-20 13:01:28] === RUN TestAccAWSLambdaFunction_VPCUpdate
[2019-09-20 13:01:28] === PAUSE TestAccAWSLambdaFunction_VPCUpdate
[2019-09-20 13:01:28] === RUN TestAccAWSLambdaFunction_VPC_withInvocation
[2019-09-20 13:01:28] === PAUSE TestAccAWSLambdaFunction_VPC_withInvocation
[2019-09-20 13:01:28] === CONT TestAccAWSLambdaFunction_VPC
[2019-09-20 13:01:28] === CONT TestAccAWSLambdaFunction_VPC_withInvocation
[2019-09-20 13:01:28] === CONT TestAccAWSLambdaFunction_VPCUpdate
[2019-09-20 13:01:28] === CONT TestAccAWSLambdaFunction_VPCRemoval
[2019-09-20 13:20:44] --- PASS: TestAccAWSLambdaFunction_VPC (1156.20s)
[2019-09-20 13:20:57] --- PASS: TestAccAWSLambdaFunction_VPC_withInvocation (1168.50s)
[2019-09-20 13:22:00] --- PASS: TestAccAWSLambdaFunction_VPCRemoval (1232.32s)
[2019-09-20 13:54:50] --- FAIL: TestAccAWSLambdaFunction_VPCUpdate (3202.02s)
[2019-09-20 13:54:50] testing.go:630: Error destroying resource! WARNING: Dangling resources
[2019-09-20 13:54:50] may exist. The full state and error is shown below.
[2019-09-20 13:54:50]
[2019-09-20 13:54:50] Error: errors during apply: 2 problems:
[2019-09-20 13:54:50]
[2019-09-20 13:54:50] - Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 34m0s)
[2019-09-20 13:54:50] - Error deleting security group: DependencyViolation: resource sg-0ca2f5e9f3f4edcde has a dependent object
[2019-09-20 13:54:50] status code: 400, request id: fd5b9c90-731a-4b4b-a885-c3caf75cdc87
[2019-09-20 13:54:50]
[2019-09-20 13:54:50] State: aws_security_group.sg_for_lambda:
[2019-09-20 13:54:50] ID = sg-0ca2f5e9f3f4edcde
[2019-09-20 13:54:50] provider = provider.aws
[2019-09-20 13:54:50] arn = arn:aws:ec2:eu-west-1:981467355511:security-group/sg-0ca2f5e9f3f4edcde
[2019-09-20 13:54:50] description = Allow all inbound traffic for lambda test
[2019-09-20 13:54:50] egress.# = 1
[2019-09-20 13:54:50] egress.482069346.cidr_blocks.# = 1
[2019-09-20 13:54:50] egress.482069346.cidr_blocks.0 = 0.0.0.0/0
[2019-09-20 13:54:50] egress.482069346.description =
[2019-09-20 13:54:50] egress.482069346.from_port = 0
[2019-09-20 13:54:50] egress.482069346.ipv6_cidr_blocks.# = 0
[2019-09-20 13:54:50] egress.482069346.prefix_list_ids.# = 0
[2019-09-20 13:54:50] egress.482069346.protocol = -1
[2019-09-20 13:54:50] egress.482069346.security_groups.# = 0
[2019-09-20 13:54:50] egress.482069346.self = false
[2019-09-20 13:54:50] egress.482069346.to_port = 0
[2019-09-20 13:54:50] ingress.# = 1
[2019-09-20 13:54:50] ingress.482069346.cidr_blocks.# = 1
[2019-09-20 13:54:50] ingress.482069346.cidr_blocks.0 = 0.0.0.0/0
[2019-09-20 13:54:50] ingress.482069346.description =
[2019-09-20 13:54:50] ingress.482069346.from_port = 0
[2019-09-20 13:54:50] ingress.482069346.ipv6_cidr_blocks.# = 0
[2019-09-20 13:54:50] ingress.482069346.prefix_list_ids.# = 0
[2019-09-20 13:54:50] ingress.482069346.protocol = -1
[2019-09-20 13:54:50] ingress.482069346.security_groups.# = 0
[2019-09-20 13:54:50] ingress.482069346.self = false
[2019-09-20 13:54:50] ingress.482069346.to_port = 0
[2019-09-20 13:54:50] name = tf_acc_sg_lambda_func_vpc_upd_aeq362mf
[2019-09-20 13:54:50] owner_id = 981467355511
[2019-09-20 13:54:50] revoke_rules_on_delete = false
[2019-09-20 13:54:50] tags.% = 0
[2019-09-20 13:54:50] vpc_id = vpc-0379e5aaa0b9d1375
[2019-09-20 13:54:50] aws_subnet.subnet_for_lambda:
[2019-09-20 13:54:50] ID = subnet-09c6f08256a1ae31f
[2019-09-20 13:54:50] provider = provider.aws
[2019-09-20 13:54:50] arn = arn:aws:ec2:eu-west-1:981467355511:subnet/subnet-09c6f08256a1ae31f
[2019-09-20 13:54:50] assign_ipv6_address_on_creation = false
[2019-09-20 13:54:50] availability_zone = eu-west-1c
[2019-09-20 13:54:50] availability_zone_id = euw1-az1
[2019-09-20 13:54:50] cidr_block = 10.0.1.0/24
[2019-09-20 13:54:50] ipv6_cidr_block =
[2019-09-20 13:54:50] ipv6_cidr_block_association_id =
[2019-09-20 13:54:50] map_public_ip_on_launch = false
[2019-09-20 13:54:50] owner_id = 981467355511
[2019-09-20 13:54:50] tags.% = 1
[2019-09-20 13:54:50] tags.Name = tf-acc-lambda-function-1
[2019-09-20 13:54:50] vpc_id = vpc-0379e5aaa0b9d1375
[2019-09-20 13:54:50] aws_vpc.vpc_for_lambda:
[2019-09-20 13:54:50] ID = vpc-0379e5aaa0b9d1375
[2019-09-20 13:54:50] provider = provider.aws
[2019-09-20 13:54:50] arn = arn:aws:ec2:eu-west-1:981467355511:vpc/vpc-0379e5aaa0b9d1375
[2019-09-20 13:54:50] assign_generated_ipv6_cidr_block = false
[2019-09-20 13:54:50] cidr_block = 10.0.0.0/16
[2019-09-20 13:54:50] default_network_acl_id = acl-04a8b60f0db2f2de8
[2019-09-20 13:54:50] default_route_table_id = rtb-0d085dcea83a43420
[2019-09-20 13:54:50] default_security_group_id = sg-03275b3ddce20b414
[2019-09-20 13:54:50] dhcp_options_id = dopt-a7f98cc1
[2019-09-20 13:54:50] enable_classiclink = false
[2019-09-20 13:54:50] enable_classiclink_dns_support = false
[2019-09-20 13:54:50] enable_dns_hostnames = false
[2019-09-20 13:54:50] enable_dns_support = true
[2019-09-20 13:54:50] instance_tenancy = default
[2019-09-20 13:54:50] ipv6_association_id =
[2019-09-20 13:54:50] ipv6_cidr_block =
[2019-09-20 13:54:50] main_route_table_id = rtb-0d085dcea83a43420
[2019-09-20 13:54:50] owner_id = 981467355511
[2019-09-20 13:54:50] tags.% = 1
[2019-09-20 13:54:50] tags.Name = terraform-testacc-lambda-function
[2019-09-20 13:54:50] FAIL
[2019-09-20 13:54:50] FAIL github.com/terraform-providers/terraform-provider-aws/aws 3202.081s
[2019-09-20 13:54:50] gmake: *** [GNUmakefile:20: testacc] Error 1
I also checked that there were no resources remaining on AWS after acceptance tests passed (network interfaces, security groups, VPCs, ...)
I am currently checking the collected logs to see what could be next ...
HTH
Hi folks 👋 We have merged in #10347 which was based off of #10114 and the excellent work done by @ewbankkit and @obourdon. This will release in version 2.31.0 of the Terraform AWS Provider, tomorrow.
We mitigate this issue by fixing the ENI description lookup and updating the Lambda ENI deletion logic to always wait a 45 minute grace period (based on Lambda service team analytics) for background processes in the Lambda infrastructure to detach Lambda Hyperplane ENIs.
All Terraform AWS Providers environments using Lambda functions with VPC configurations should strongly consider updating to version 2.31.0 or higher as the Lambda service changes are planned to continue rolling out to all AWS regions and accounts in the coming weeks. For environments that cannot upgrade yet, there is now a followup issue, https://github.com/terraform-providers/terraform-provider-aws/issues/10329, which highlights some Terraform configuration changes that can help mitigate the issue in older Terraform AWS Provider versions. That issue can also be followed for future updates about deletion time reductions for the new Lambda networking.
This has been released in version 2.31.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.
For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!
Just a small comment to confirm that 2.31.0 completely fix the issue
Many thanks to all who have worked for solving this
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!
Most helpful comment
Hi folks 👋 We have merged in #10347 which was based off of #10114 and the excellent work done by @ewbankkit and @obourdon. This will release in version 2.31.0 of the Terraform AWS Provider, tomorrow.
We mitigate this issue by fixing the ENI description lookup and updating the Lambda ENI deletion logic to always wait a 45 minute grace period (based on Lambda service team analytics) for background processes in the Lambda infrastructure to detach Lambda Hyperplane ENIs.
All Terraform AWS Providers environments using Lambda functions with VPC configurations should strongly consider updating to version 2.31.0 or higher as the Lambda service changes are planned to continue rolling out to all AWS regions and accounts in the coming weeks. For environments that cannot upgrade yet, there is now a followup issue, https://github.com/terraform-providers/terraform-provider-aws/issues/10329, which highlights some Terraform configuration changes that can help mitigate the issue in older Terraform AWS Provider versions. That issue can also be followed for future updates about deletion time reductions for the new Lambda networking.