Cluster-api: machine controller should sense the event that infra obj turns into notReady

Created on 20 Oct 2019  ·  24Comments  ·  Source: kubernetes-sigs/cluster-api

if the infra obj turns from ready to notReady and not been patched with any ErrorMsg/ErrReason, we only requeue the machine obj and stays in running state. https://github.com/kubernetes-sigs/cluster-api/blob/0da8bded303aa68acc3eb5fe29d49cf4cba7abb9/controllers/machine_controller_phases.go#L232
if the infra machcine is recoverable(like reboot), i don't think the infra controller should set any ErrorMsg/ErrReason on the infra resource.

/kind bug

help wanted kinbug lifecyclrotten prioritimportant-soon

All 24 comments

I believe this is handled in a different place. The reconcile flow is:

  1. https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller.go#L163
  2. https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller_phases.go#L209
  3. https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller_phases.go#L135-L148

At this point, the machine's .status.errorReason and/or .status.errorMessage have been populated if the infra resource was in error. This happens regardless of the value of the infra resource's .status.ready field. The line you linked above happens later.

However, as best I can tell, once the machine's .status.errorReason and/or .status.errorMessage are set, the code does not provide any way to clear them. But I believe this is the correct behavior. If a machine is in error, this is a permanent failure condition.

I believe this is handled in a different place. The reconcile flow is:

  1. https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller.go#L163
  2. https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller_phases.go#L209
  3. https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller_phases.go#L135-L148

At this point, the machine's .status.errorReason and/or .status.errorMessage have been populated if the infra resource was in error. This happens regardless of the value of the infra resource's .status.ready field. The line you linked above happens later.

However, as best I can tell, once the machine's .status.errorReason and/or .status.errorMessage are set, the code does not provide any way to clear them. But I believe this is the correct behavior. If a machine is in error, this is a permanent failure condition.

sry, i might not express my self clearly, if the infra machine turned into a permanent failure condition, the controller on the infra side should set errMsg/errReason, i agreed with this. but if the infra machine is recoverable(like reboot), the controller on the infra side(in my implemention) would just set the infra machine from ready to not ready, and machine controller in cluster-api could not sense it but stayed in running state.

and i have this situation discussed with @detiber on slack. he agreed with me too.
image

Ok, what you're pointing out makes sense. My apologies for being slightly confused 😄.

/help

@ncdc:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

For context, in v1alpha2 we chose not to recover from a failure case to avoid loops or going back and forth between states outlined in the machine proposal. This behavior aligns with the assumption that Machines are immutable, if you get a failure, the action the system expects an operator to take is to delete and recreate the Machine.

@vincepri There are currently plenty of recoverable errors that we encounter today, we just don't necessarily bubble them up in any way. For example if the AWS API is unavailable for any reason.

I don't think we should say that any error should be unrecoverable, nor should we say that we should resolve errors in the controllers. I think the case that is more grey (and potentially where this issue comes into play) is how to bubble up recoverable errors (that are resolved on the cloud provider side) to users without turning them into non-recoverable errors (ErrorMessage/ErrorReason).

In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully?

Yes, we need to distinguish (somehow) between controller-retryable errors and failures.

In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully?

It'll be up to the infrastructure provider to make sure a machine that's being rebooted isn't marked as failed. ~From CAPI perspective, a Machine that's being rebooted, it'll cause the Node to become not ready and fall from Running state to Provisioned.~

Nevermind, https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller_phases.go#L60-L63 the code actually doesn't check for node readiness.

@vincepri to look up what we wrote in the v1a2 proposal around ready->notReady transitions.

One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready.

look up what we wrote in the v1a2 proposal around ready->notReady transitions.

If we allow to flip from ready -> notReady, the phase would go back to provisioning as outlined in the logic here https://github.com/kubernetes-sigs/cluster-api/blob/d3140bacceea81d85da9341a591bb87e0ed2d008/controllers/machine_controller_phases.go#L51

I'm not super opposed to this behavior, but not in total favor either. I'd rather have the Machine controller detect that the linked node is not in a ready state or not by checking the NodeRef and flip the status to Provisioned. Which is in line with the state conditions outlined in the original proposal https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190610-machine-states-preboot-bootstrapping.md#transition-conditions-3.

One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready.

I'd like to avoid, at least for now, to keep track of previous states and introduce something like Conditions to understand how we arrived to a state. This would need to be in a separate CAEP, wdyt?

👍 for checking the node ref and flipping it back to provisioned for the time being, and doing a separate proposal around conditions (hint hint @timothysc 😄)

/assign

I'll work on this issue if nobody has any concerns.

I'm wholly against reversing machine states. The flow of the machine-controller and associated infra controller should be one-way.

If some machine that was previously ready (eg, became a node) has now failed, we'll see that indication on the node, and we remediate:
1) Delete the machine-object
2) Machine-controller cordon and drains node
3) Backing instance is deleted from cloud
4) MachineSet or other thing creates a replacement machine

Immutable infrastructure.

For me this is less about reversing states as much as giving users visibility into the underlying resources we are managing.

From a user's point of view (not taking into account automated remediation for the moment). If I'm looking at Cluster API and it is telling me that my Machine is "Ready" (as a proxy for both Node being present and the underlying infrastructure being in a running state), but the underlying infra is not in a running state I no longer have a sense for what or how to start remediating the issue.

I don't think we should be thinking of this from the perspective of reversing the state machine as much as providing visibility into the actual state of the world wrt the resources we are managing for the user.

From a troubleshooting perspective, I'd like to ideally get signal from the Cluster, which gives me enough information to follow the breadcrumbs until I get to the actual problematic resource.

underlying resources we are managing.

A node is an overlying resource relative to a machine. That's where a user's focus should be, the health of nodes. If we're talking about disposable, immutable machines, if the node goes unready/unhealthy, you get rid of it. You don't have to care what the layer underneath is doing, that's the entire point.

If I'm looking at Cluster API and it is telling me that my Machine is "Ready"

Yeah, "Ready" doesn't make sense for a machine. Probably the terminal state should be 'Complete' after it becomes a node.

I'd like to ideally get signal from the Cluster

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue.

Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time.

I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue.

Remediation always needs to start at the node, automated or otherwise. If you have an unhealthy node, you delete the backing machine-object. If the resolution is any more complicated than that, they'll need to know those provider details either way.

We could build logic into the machine-controller to delete an associated machine if the node object is deleted, but there's other tradeoffs with that (eg, users might not drain the node first, etc).

Let's take Kubernetes out of the equation: Say you have an application deployed across 3 instances in the cloud, and they're deployed behind a load balancer. The load balancer has healthchecks, it detects whether or not the application is up, not he backing instance. You could build automation to watch which endpoints are healthy in the loadbalancer, and replace those instances when they fail.

Back to Kuberentes: In our case, the application we care about is the node. Our application can go down for a variety of reasons, but that's our primary signal. If the instance dies, the application dies with it. If the application can't start due to corruption, well the instance might be alive and well as far as the cloud provider is concerned. It doesn't make any sense to label the cloud-vm itself with the node's status, thus it doesn't make any sense to label the machine-object with the node's status.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

I think this has been fixed on master / v1alpha3. We reconcile the ready value whenever it changes, can this be closed?

/cc @ncdc

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Closing, fixed in v1alpha3
/close

@vincepri: Closing this issue.

In response to this:

Closing, fixed in v1alpha3
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings