Origin: Deleting node (no longer present in cloud provider)

Created on 24 Aug 2017  路  5Comments  路  Source: openshift/origin

I realize that I'm reporting on the enterprise version and not specifically on origin. The enterprise version we are using is for trial purposes and testing until we can get this to work. Basically in a nutshell we are observing that the controller is deleting our nodes that are in a different aws account because it cannot find them. We can get around this issue if we manually add a hostsubnet. the trick is if we add it enough times, and restart the nodes service it will eventually not reject the node and work.
Looking at the code from kubernetes I can see this appears to be expected, and maybe a federation is necissary. But I do not see that the federation is even added yes and is still alpha. So I'm kind of at a loss and not found very much information with anyone doing a similar setup. Any help is much appreciated.

Version

oc version
oc v3.5.5.31
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

openshift v3.5.5.31
kubernetes v1.5.2+43a9be4

Steps To Reproduce

We have two AWS accounts, one for our test/dev env and one for our stage env. Basically our masters, infra, and primary nodes all live in the test AWS account. Our stage env only has 3 nodes that we managed to join to the test cluster.

Current Result

When we restart our master API, the controller deletes the nodes in the stage account as it cannot find them. I assume its looking at the test account and seeing they do not exist.

Expected Result

Stage nodes do not get deleted

Additional Information

Logs from the master -

Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.000205  116154 nodecontroller.go:419] NodeController observed a new Node: "ip-172-29-100-200.ec2.internal"
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.000243  116154 controller_utils.go:274] Recording Registered Node ip-172-29-100-200.ec2.internal in NodeController event message for node ip-172-29-100-200.ec2.internal
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.000768  116154 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-29-100-200.ec2.internal", UID:"cd96edac-8849-11e7-9c05-12ede27226a6", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node ip-172-29-100-200.ec2.internal event: Registered Node ip-172-29-100-200.ec2.internal in NodeController
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.170677  116154 nodecontroller.go:508] Deleting node (no longer present in cloud provider): ip-172-29-100-200.ec2.internal
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.170714  116154 controller_utils.go:274] Recording Deleting Node ip-172-29-100-200.ec2.internal because it's not present according to cloud provider event message for node ip-172-29-100-200.ec2.internal
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.170794  116154 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-29-100-200.ec2.internal", UID:"cd96edac-8849-11e7-9c05-12ede27226a6", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeletingNode' Node ip-172-29-100-200.ec2.internal event: Deleting Node ip-172-29-100-200.ec2.internal because it's not present according to cloud provider
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.195322  116154 subnets.go:111] Deleted HostSubnet ip-172-29-100-200.ec2.internal (host: "ip-172-29-100-200.ec2.internal", ip: "172.29.100.200", subnet: "10.128.25.0/24")

Logs from the node

 81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
 81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
 81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
 81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
 81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
 81990 kubelet_node_status.go:323] Unable to update node status: update node status exceeds retry count
 81990 eviction_manager.go:204] eviction manager: unexpected err: failed GetNode: node 'ip-172-29-100-200.ec2.internal' not found
 81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
componenkubernetes kinbug lifecyclrotten prioritP2

Most helpful comment

Do you have the KubernetesCluster tag in your ec2 instances? You need something like KubernetesCluster=foo on all your instances.

All 5 comments

Do you have the KubernetesCluster tag in your ec2 instances? You need something like KubernetesCluster=foo on all your instances.

@andrewklau
These are the tags that I have specified on the instances
KubernetesCluster | TestCluster
Name | OpenShift Node - 172.29.100.200

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Was this page helpful?
0 / 5 - 0 ratings