I realize that I'm reporting on the enterprise version and not specifically on origin. The enterprise version we are using is for trial purposes and testing until we can get this to work. Basically in a nutshell we are observing that the controller is deleting our nodes that are in a different aws account because it cannot find them. We can get around this issue if we manually add a hostsubnet. the trick is if we add it enough times, and restart the nodes service it will eventually not reject the node and work.
Looking at the code from kubernetes I can see this appears to be expected, and maybe a federation is necissary. But I do not see that the federation is even added yes and is still alpha. So I'm kind of at a loss and not found very much information with anyone doing a similar setup. Any help is much appreciated.
oc version
oc v3.5.5.31
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO
openshift v3.5.5.31
kubernetes v1.5.2+43a9be4
We have two AWS accounts, one for our test/dev env and one for our stage env. Basically our masters, infra, and primary nodes all live in the test AWS account. Our stage env only has 3 nodes that we managed to join to the test cluster.
When we restart our master API, the controller deletes the nodes in the stage account as it cannot find them. I assume its looking at the test account and seeing they do not exist.
Stage nodes do not get deleted
Logs from the master -
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.000205 116154 nodecontroller.go:419] NodeController observed a new Node: "ip-172-29-100-200.ec2.internal"
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.000243 116154 controller_utils.go:274] Recording Registered Node ip-172-29-100-200.ec2.internal in NodeController event message for node ip-172-29-100-200.ec2.internal
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.000768 116154 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-29-100-200.ec2.internal", UID:"cd96edac-8849-11e7-9c05-12ede27226a6", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node ip-172-29-100-200.ec2.internal event: Registered Node ip-172-29-100-200.ec2.internal in NodeController
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.170677 116154 nodecontroller.go:508] Deleting node (no longer present in cloud provider): ip-172-29-100-200.ec2.internal
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.170714 116154 controller_utils.go:274] Recording Deleting Node ip-172-29-100-200.ec2.internal because it's not present according to cloud provider event message for node ip-172-29-100-200.ec2.internal
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.170794 116154 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-29-100-200.ec2.internal", UID:"cd96edac-8849-11e7-9c05-12ede27226a6", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeletingNode' Node ip-172-29-100-200.ec2.internal event: Deleting Node ip-172-29-100-200.ec2.internal because it's not present according to cloud provider
Aug 23 16:27:07 ip-10-101-22-100.localdomain atomic-openshift-master-controllers[116154]: I0823 16:27:07.195322 116154 subnets.go:111] Deleted HostSubnet ip-172-29-100-200.ec2.internal (host: "ip-172-29-100-200.ec2.internal", ip: "172.29.100.200", subnet: "10.128.25.0/24")
Logs from the node
81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
81990 kubelet_node_status.go:323] Unable to update node status: update node status exceeds retry count
81990 eviction_manager.go:204] eviction manager: unexpected err: failed GetNode: node 'ip-172-29-100-200.ec2.internal' not found
81990 kubelet_node_status.go:331] Error updating node status, will retry: no node instance returned for "ip-172-29-100-200.ec2.internal"
Do you have the KubernetesCluster tag in your ec2 instances? You need something like KubernetesCluster=foo on all your instances.
@andrewklau
These are the tags that I have specified on the instances
KubernetesCluster | TestCluster
Name | OpenShift Node - 172.29.100.200
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
Most helpful comment
Do you have the KubernetesCluster tag in your ec2 instances? You need something like
KubernetesCluster=fooon all your instances.