Cluster-api: States, transitions, and conditions

Created on 25 Oct 2019  路  10Comments  路  Source: kubernetes-sigs/cluster-api

Opening this issue as a placeholder, to be filled as time goes by. We know that the current model has gaps in terms of Status.Phase field, its use and shortcomings. We'll be tracking progress towards a proposal and better model within this issue. Feel free to add more information, comments, concerns.

/kind proposal
/milestone Next
/priority important-longterm
/assign @timothysc

areapi kinproposal lifecyclactive prioritimportant-longterm

Most helpful comment

Some drive-by comments (feel free to ping me at evana at vmware.com if I don't notice updates here):

We spent a long time (about 4 weeks) working on Condition design with UX designers and tooling engineers. The end result is specified here, but I'll also give some useful summaries and a link to the library we wrote:

https://knative.dev/docs/serving/spec/knative-api-specification-1.0/#error-signalling

Some of the lessons we learned:

  • Having a consistent polarity for Conditions (i.e. "True = good" or "False = good") is really helpful for tooling, particularly if you want the server side to be able to add or change conditions without needing to go through and coordinate with every single tool that works with the API.

    • It's important to name the conditions so that it's clear what "True" and "False" means. Sometimes this means you need to be creative, i.e. "ResourcesAvailable" even if you're really thinking of it as "QuotaExhausted" (and definitely don't call the condition "Quota" without indicating what True and False mean).

  • It's really useful to have an agreed-on top-level condition that summarizes all the other conditions (again, it simplifies work for tool authors). In Knative, we defined two top-level conditions. Each resource should have one or the other of these conditions (and not both):

    • Ready, for resources which represent an ongoing status, like a Deployment

    • Succeeded, for resources which run-to-completion, like a Job

  • Every resource except data-only objects which aren't reconciled (like ConfigMap) must have conditions. This avoids situations like NetworkPolicy, where you can't tell if the cluster is actually able to support the NetworkPolicy based on the CNI plugin used.
  • Our experience with actual users suggested that the "True = good" polarity was read more naturally than "False = good". In particular, "FooFailed = False" tended to be read as a problem or a negative situation, rather than a successful state. This is particularly an issue for users in stressful situations (new users, live demos, handling production outages, etc).
  • Knative also added a "severity" field to conditions to handle providing warnings. Note that the default severity was "Error", which we ended up spelling as "" (empty string). See the previous point for the rationale. (Multiple senior PMs and engineers were thrown off by:
    yaml conditions: <ul> <li>type=Ready<br /> severity=Error<br /> status=True<br />

    Severity may be overkill for your needs; we found some cases where it was helpful to be able to pass back information in conditions that didn't indicate an error.

Library implementing consistent ("True = good") polarity, condition aggregation to common top-level: https://godoc.org/github.com/knative/pkg/apis#ConditionSet.Manage

All 10 comments

Some controversy over lists of named subobjects vs use of maps

Just getting into CAPI internals so could be off - but - as an interim solution with long term dashboard value --- Maybe for now just putting the right metrics for API calls in place will be a good enough workaround ---

  • labels for the providers based on infrastructureRef will give you provider specific states
  • labels such as still_awaiting_cloud_controller
  • update docs to say 'if your cluster isnt converging - curl localhost:9090 | grep cloud_controller and look at the numbers'

If this issue is adequetely resourced, then the status/phase implementation is also great... just wanted to make sure i offered up a lightweight workaround in case this might take a while.

Some drive-by comments (feel free to ping me at evana at vmware.com if I don't notice updates here):

We spent a long time (about 4 weeks) working on Condition design with UX designers and tooling engineers. The end result is specified here, but I'll also give some useful summaries and a link to the library we wrote:

https://knative.dev/docs/serving/spec/knative-api-specification-1.0/#error-signalling

Some of the lessons we learned:

  • Having a consistent polarity for Conditions (i.e. "True = good" or "False = good") is really helpful for tooling, particularly if you want the server side to be able to add or change conditions without needing to go through and coordinate with every single tool that works with the API.

    • It's important to name the conditions so that it's clear what "True" and "False" means. Sometimes this means you need to be creative, i.e. "ResourcesAvailable" even if you're really thinking of it as "QuotaExhausted" (and definitely don't call the condition "Quota" without indicating what True and False mean).

  • It's really useful to have an agreed-on top-level condition that summarizes all the other conditions (again, it simplifies work for tool authors). In Knative, we defined two top-level conditions. Each resource should have one or the other of these conditions (and not both):

    • Ready, for resources which represent an ongoing status, like a Deployment

    • Succeeded, for resources which run-to-completion, like a Job

  • Every resource except data-only objects which aren't reconciled (like ConfigMap) must have conditions. This avoids situations like NetworkPolicy, where you can't tell if the cluster is actually able to support the NetworkPolicy based on the CNI plugin used.
  • Our experience with actual users suggested that the "True = good" polarity was read more naturally than "False = good". In particular, "FooFailed = False" tended to be read as a problem or a negative situation, rather than a successful state. This is particularly an issue for users in stressful situations (new users, live demos, handling production outages, etc).
  • Knative also added a "severity" field to conditions to handle providing warnings. Note that the default severity was "Error", which we ended up spelling as "" (empty string). See the previous point for the rationale. (Multiple senior PMs and engineers were thrown off by:
    yaml conditions: <ul> <li>type=Ready<br /> severity=Error<br /> status=True<br />

    Severity may be overkill for your needs; we found some cases where it was helpful to be able to pass back information in conditions that didn't indicate an error.

Library implementing consistent ("True = good") polarity, condition aggregation to common top-level: https://godoc.org/github.com/knative/pkg/apis#ConditionSet.Manage

Thank you so much @evankanderson for sharing your experience with conditions! After v1alpha3 is wrapped up, we'll start working on a proposal, the context you provided is going to be super helpful.

/milestone v0.3.x

/assign @fabriziopandini
/lifecycle active
/area api

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

@vincepri: Closing this issue.

In response to this:

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri @ncdc This is linked in the roadmap under "Improved status conditions" https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/book/src/roadmap.md

Based on https://github.com/kubernetes-sigs/cluster-api/issues/1658#issuecomment-620118757
is there a doc related to this? if so is there a place where I can find the link to that doc? I'd expect this ticket to be that place so the roadmap does not lead us to a dead end.

My intention was to add a relevant ref https://github.com/kubernetes/enhancements/pull/1624 to this topic.

new tracking issue for the CAEP is https://github.com/kubernetes-sigs/cluster-api/issues/3005
@enxebre the Kuberenetes KEP is already in the radar

Was this page helpful?
0 / 5 - 0 ratings