Aws-load-balancer-controller: ALB ingress controller is looking for the instances in an unkonwn VPC to place the target groups

Created on 11 Mar 2019 · 30Comments · Source: kubernetes-sigs/aws-load-balancer-controller

After deploying ingress resource to the ingress controller, below error is getting looped in the logs, I searched for the VPC in my account but there is no such VPC at all.

Error adding targets to target group arn:aws:elasticloadbalancing:ap-southeast-1:*:targetgroup/a6106773-73ee60071b3e98f408a/d9a7**: InvalidTarget: The following targets are not in the target group VPC 'vpc-0dc2960bdb7e9': 'i-0e62a05e40c2', 'i-0e3e11eb9294', 'i-0b*718cdd87091

lifecyclrotten

Source

chsaiimannoj

👍6

Most helpful comment

I just encountered this bug.

alb-ingress-controller attempts to use an existing target group if your ingress spec doesn't change, even if your VPC does.

For example, if you were to build your entire stack in CloudFormation, then delete it, then recreate it, alb-ingress-controller uses the target groups from your first VPC.

Maybe implementing something to cleanup target groups when deleting your cluster/vpc/ingress or having alb-ingress-controller check your configured VPC ID before using a bad target group?

stevenoctopus on 15 Apr 2019

👍17

All 30 comments

Hi,
Do you mean the vpc-0dc2960bdb7e9 is not your cluster's VPC?
By default, the alb ingress controller infers the VPC of cluster by accessing ec2metadata(from the controller pod). Did you running any sidecar that hijacked the ec2metadata call(such as kube2iam)?
If so, you can manually specify the vpcID and via --aws-vpc-id=YourVPCID and --aws-region=YourClusterRegion.

M00nF1sh on 15 Mar 2019

I actually have the same issue when deploying a chart with an ALB ingress using Helm. I notice that it happens if I try to deploy my chart on a freshly created EKS cluster immediately after bringing the cluster up and installing the ALB ingress controller. If I wait like 10 minutes after installing the ALB ingress controller and then install my chart the ingress is created successfully. I noticed that the error message contains the correct instance IDs value but it has the wrong VPC value. The value that it's using for the VPC does not exist in any region in my AWS account, and through testing this several times on freshly created EKS clusters the value of the VPC actually changes. I have no idea where it's getting those VPC values from. I don't have any other add-ons or sidecars that could be responsible for this. I just have these running:

kube-system alb-ingress-controller-55fdf469dc-wsdn2 1/1 Running 0 1h kube-system aws-node-df7jv 1/1 Running 0 1h kube-system aws-node-npbq6 1/1 Running 0 1h kube-system aws-node-zd7w6 1/1 Running 0 1h kube-system coredns-7d77776957-hn9r9 1/1 Running 0 1h kube-system coredns-7d77776957-ts8vl 1/1 Running 0 1h kube-system kube-proxy-2c8dr 1/1 Running 0 1h kube-system kube-proxy-cc4rc 1/1 Running 0 1h kube-system kube-proxy-vqjf9 1/1 Running 0 1h kube-system tiller-deploy-85744d9bfb-jzwbg 1/1 Running 0 1h

campee on 27 Mar 2019

👍2

Here is an example of one such error from the ingress controller:

The instance IDs were correct but the VPC was not and this was not a VPC running in ANY region within my account.

E0326 22:41:44.279073 1 targets.go:80] default/ui: status code: 400, request id: 54b2ece6-5018-11e9-b675-db068043e3e4 E0326 22:41:44.279176 1 :0] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile targetGroups due to failed to reconcile targetGroup targets due to InvalidTarget: The following targets are not in the target group VPC 'vpc-04e6ea299853b63eb': 'i-08a5793d40144befe', 'i-031442f34a892b4eb', 'i-0b4c0698d1be8726e'\n\tstatus code: 400, request id: 54b2ece6-5018-11e9-b675-db068043e3e4" "Controller"="alb-ingress-controller" "Request"={"Namespace":"default","Name":"ui"}

campee on 27 Mar 2019

I'm also noticing that it fails on the first ingress that I create but if I delete that ingress and then reapply the exact same manifest to recreate the ingress it works the second time.

campee on 28 Mar 2019

I'm also noticing that it fails on the first ingress that I create but if I delete that ingress and then reapply the exact same manifest to recreate the ingress it works the second time.

campee on 28 Mar 2019

Just encountered the same issue on a newly created EKS cluster.

Versions:
Image version: docker.io/amazon/aws-alb-ingress-controller:v1.0.1
EKS Kubernetes version: 1.12

The load balancer (ALB) was deployed into the correct cluster VPC, but the Target Group was associated with a different VPC not found in any region in the account.

Logs were the same as campee reported. Recreating the ingress controller and redeploying charts that created ingress resources resolved the issue.

Edit: both --aws-vpc-id and --aws-region were specified as arguments in the ingress controller deployment.

huntermassey on 11 Apr 2019

@huntermassey
This is super wired...did you used some vpc-peering EKS cluster?
Would you help share your aws accountID and clusterName with me([email protected])? And also the error vpcID for targetGroup(if you still have the logs).

M00nF1sh on 12 Apr 2019

I am also having this issue on a brand new EKS cluster.

E0413 01:11:21.259596 1 :0] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile targetGroups due to failed to reconcile targetGroup targets due to InvalidTarget: The following targets are not in the target group VPC 'vpc-03caf487fb0a0174d': 'i-0f8e658161b8af7e7', 'i-093bf9b038f794fc0', 'i-0914c426325445091'\n

I've flipped through all of my regions in AWS and I have the default AWS VPC and a VPC created by terraform (which is also creating the EKS cluster) and I cannot find VPC ID vpc-03caf487fb0a0174d anywhere.

[edit]
it's probably also work noting that this is an unused AWS account. The only resources in the entire account are the items mentioned here, an S3 bucket and some users/roles/policies.

adamwolfe-tc on 13 Apr 2019

I just encountered this bug.

alb-ingress-controller attempts to use an existing target group if your ingress spec doesn't change, even if your VPC does.

For example, if you were to build your entire stack in CloudFormation, then delete it, then recreate it, alb-ingress-controller uses the target groups from your first VPC.

Maybe implementing something to cleanup target groups when deleting your cluster/vpc/ingress or having alb-ingress-controller check your configured VPC ID before using a bad target group?

stevenoctopus on 15 Apr 2019

👍17

@stevenoctopus you just saved my day from debuging this issue 🤣 .....Yeah, this should be what happened.
The unique ID for AWS Resources(LB/TG) are computed with clusterName, namespace, ingressName/svcName which didn't include vpcID.
Users should delete the ingresses before turn down the cluster(the alb ingress controller will then delete the aws resources, we'll use finalizer to ensure aws resources are deleted before the ingress object are removed).

M00nF1sh on 15 Apr 2019

+1 - in my case we had just torn down a cluster+VPC before standing the whole thing up again. Looked at CloudTrail logs, the erroneous VPC for the target group was the one we had deleted.

So for those of us with automation to stand up/tear down clusters & VPCs with CloudFormation etc., we will need to delete all ingress resources prior to deleting the cluster to avoid this issue?

Any plans to factor in VPC ID in unique ID, or something else so this extra step isn't required?

huntermassey on 15 Apr 2019

@huntermassey Factor in vpcID in uniqueID didn't solve the root problem but silently bypass it. (you will have unused targetGroups dangling in aws account). Instead, we can validate the vpcID and provides better error message for this case.

I think delete all ingresses for automation is the best way to go. (Once we add support finalizers, delete all namespaces should also work :D)

M00nF1sh on 15 Apr 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 15 Jul 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 14 Aug 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 13 Sep 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 13 Sep 2019

Reopening this as we need another solution for those restoring from etcd backup....the controller needs to re-verify the metadata somehow....

Morriz on 22 Oct 2019

@Morriz: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

k8s-ci-robot on 22 Oct 2019

Can somebody reopen this?

Morriz on 22 Oct 2019

@Morriz Is your case caused by https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/889#issuecomment-483417207?

M00nF1sh on 22 Oct 2019

Agree also to reopen this. Any time I deploy an ingress for the first time, I have to delete it and re-deploy it in order to make it work.

deimosfr on 26 Jan 2020

👍5

I'm still having this problem with 1.1.5 and terraform 0.12.23.

What's the solution from @stevenoctopus's suggestion?

How can we have the alb-ingress-controller check for the configred vpc id?
How can we automatically clean up stale target groups?

Maybe implementing something to cleanup target groups when deleting your cluster/vpc/ingress or having alb-ingress-controller check your configured VPC ID before using a bad target group?

agconti on 12 Mar 2020

Yup, this is still an issue, seeing this exact behavior after tearing down a VPC & EKS cluster to upgrade to EKS 1.15.

log0ymxm on 16 Jun 2020

👍8

/reopen

knight42 on 10 Aug 2020

@knight42: Reopened this issue.

In response to this:

/reopen

k8s-ci-robot on 10 Aug 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 9 Sep 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot on 9 Sep 2020

@Morriz Is your case caused by #889 (comment)?

it most probably is, but there is no fix coming then?

Morriz on 9 Sep 2020

/reopen

Would be great to have an option to update target groups with the autodiscovered VPC ID if it's out of date. Unfortunately deleting a VPC doesn't require removing target groups, so if a cluster gets torn down improperly at any time it causes an issue going forward without manual intervention.

maracle6 on 26 Oct 2020

@maracle6: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Would be great to have an option to update target groups with the autodiscovered VPC ID if it's out of date. Unfortunately deleting a VPC doesn't require removing target groups, so if a cluster gets torn down improperly at any time it causes an issue going forward without manual intervention.

k8s-ci-robot on 26 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings