Cloud-on-k8s: Enforce Elastic stack upgrade order

Created on 24 Feb 2020  路  3Comments  路  Source: elastic/cloud-on-k8s

Supersedes #2426 and #2353 (see there for more context and discussion)

Currently each stack application is reconciled independently of each other. When a user upgrades the version of a group of linked resources (Kibana, ES, APMServer) this is violating our own documented stack upgrade procedure which clearly defines an upgrade order. The problems described in #2426 and #2353 would be avoided if the upgrade order would be enforced by ECK:

  1. upgrade Elasticsearch
  2. upgrade Kibana
  3. upgrade APMServer

For that the Kibana and APM controllers would have to

  • inspect the linked Elasticsearch resource, and compare the specified version with their own versions
  • find a way of determining whether or not the version/generation specified is also the version/generation currently rolled out
  • delay their own rollout of the version upgrade until ES and all predecessors in the upgrade graph have completed their upgrade

Problems/Questions:

  • how do we express the upgrade order? APMServer does not hold a reference to the related Kibana instance. How does it know the state of any related Kibana node?
  • how do we know that a given spec is rolled out without looking at individual pods. Can we rely on an extended status subresource that somehow expresses the reconciliation state?
>enhancement discuss

Most helpful comment

One thing that was mentioned on https://github.com/elastic/cloud-on-k8s/issues/2353#issuecomment-574358217 is to reject (as opposed to delay) the version upgrade at the validation webhook level. I think we should not go with that approach since it makes the user experience a bit painful. As a user I would expect I can update both Elasticsearch and Kibana version at the same time in the YAML manifest, then let ECK deal with it.


A while ago we tried to design the association controllers in such a way that they may run with different managed namespaces contexts. For example the Kibana controller may not have access to the Elasticsearch namespace, so it should not deal with Elasticsearch resources at all. I'm not sure where we stand now regarding this "constraint", but I think we should aim at keeping the various controllers responsibilities decoupled so that, for example:

  • the Kibana-ES association controller does not have to deal with Elasticsearch Pods
  • the APM controller does not have to inspect a Kibana resource (nor Kibana Pods)

Here is what I have in mind, I think it's pretty close to ideas @pebrc expressed in the first post:

Currently running version reflected in the status

Add an additional version field in the status all CRDs, set by each resource controller. It represents the lowest currently running version. For example, it would still indicate 7.7.0 if at least one ES node is running 7.7.0, even though other nodes are already running 7.8.0. This can be done by simply looking at the current StatefulSet|Deployment and Pods at the end of the Elasticsearch (Kibana, etc.) reconciliation. I think it can be useful beyond the scope of this PR, just to know what's the state of the current version upgrade.

{
  "availableNodes": 1,
  "version": "7.7.0",
  "health": "green",
  "phase": "ApplyingChanges"
}

Annotation set by the association controller in associated resources

In each association controller (eg. the Kibana-Elasticsearch association controller), annotate each resource where the association is specified (eg. Kibana) with the version of the resource it is associated to (eg. Elasticsearch), retrieved from its status. For example, if an Elasticsearch resource status reports version: 7.7.0, the association controller sets the following annotation in the Kibana resource:

kibanaassociation.k8s.elastic.co/elasticsearch-version: 7.7.0

This would be done generically by the association controller for any association:

The APM-Kibana association controller would set the following annotation on the APM resource:

apmassociation.k8s.elastic.co/kibana-version: 7.7.0

The Beat-ES association controller would set the following annotation on the Beat resource:

beatassociation.k8s.elastic.co/elasticsearch-version: 7.7.0

The Beat-Kibana association controller would set the following annotation on the beat resource:

beatassociation.k8s.elastic.co/kibana-version: 7.7.0

The association controller also ensures the annotation is not set if there is no association specified.

Delaying resources version upgrade

As part of each resource reconciliation (eg. before updating a Deployment in the Kibana controller), inspect the annotations that make sense (eg. kibanaassociation.k8s.elastic.co/elasticsearch-version):

  • if not set, and no Elasticsearch is referenced in the Kibana resource, the controller can move on with any version upgrade
  • if not set, but an Elasticsearch is referenced in the Kibana resource, the association controller will likely set an annotation soon: requeue
  • if version expressed in the annotation is lower than the expected resource version, do not perform any action that would lead to creating an instance of the resource in the newer version. Update the resource status to phase: DelayingChanges (as opposed to: ApplyingChanges).
{
  "availableNodes": 1,
  "version": "7.7.0",
  "health": "green",
  "phase": "DelayingChanges"
}
  • if version expressed in the annotation is higher or equal to the expected resource version: move on with any change

This mechanism does not enforce a global stack upgrade order, but rather an implicit dependency graph between associated resources:

  • Kibana can only be upgraded after Elasticsearch
  • APM can only be upgraded after Elasticsearch, and after Kibana
  • Beats can only be upgraded after Elasticsearch, and after Kibana
  • Enterprise Search can only be upgraded after Elasticsearch

All 3 comments

One thing that was mentioned on https://github.com/elastic/cloud-on-k8s/issues/2353#issuecomment-574358217 is to reject (as opposed to delay) the version upgrade at the validation webhook level. I think we should not go with that approach since it makes the user experience a bit painful. As a user I would expect I can update both Elasticsearch and Kibana version at the same time in the YAML manifest, then let ECK deal with it.


A while ago we tried to design the association controllers in such a way that they may run with different managed namespaces contexts. For example the Kibana controller may not have access to the Elasticsearch namespace, so it should not deal with Elasticsearch resources at all. I'm not sure where we stand now regarding this "constraint", but I think we should aim at keeping the various controllers responsibilities decoupled so that, for example:

  • the Kibana-ES association controller does not have to deal with Elasticsearch Pods
  • the APM controller does not have to inspect a Kibana resource (nor Kibana Pods)

Here is what I have in mind, I think it's pretty close to ideas @pebrc expressed in the first post:

Currently running version reflected in the status

Add an additional version field in the status all CRDs, set by each resource controller. It represents the lowest currently running version. For example, it would still indicate 7.7.0 if at least one ES node is running 7.7.0, even though other nodes are already running 7.8.0. This can be done by simply looking at the current StatefulSet|Deployment and Pods at the end of the Elasticsearch (Kibana, etc.) reconciliation. I think it can be useful beyond the scope of this PR, just to know what's the state of the current version upgrade.

{
  "availableNodes": 1,
  "version": "7.7.0",
  "health": "green",
  "phase": "ApplyingChanges"
}

Annotation set by the association controller in associated resources

In each association controller (eg. the Kibana-Elasticsearch association controller), annotate each resource where the association is specified (eg. Kibana) with the version of the resource it is associated to (eg. Elasticsearch), retrieved from its status. For example, if an Elasticsearch resource status reports version: 7.7.0, the association controller sets the following annotation in the Kibana resource:

kibanaassociation.k8s.elastic.co/elasticsearch-version: 7.7.0

This would be done generically by the association controller for any association:

The APM-Kibana association controller would set the following annotation on the APM resource:

apmassociation.k8s.elastic.co/kibana-version: 7.7.0

The Beat-ES association controller would set the following annotation on the Beat resource:

beatassociation.k8s.elastic.co/elasticsearch-version: 7.7.0

The Beat-Kibana association controller would set the following annotation on the beat resource:

beatassociation.k8s.elastic.co/kibana-version: 7.7.0

The association controller also ensures the annotation is not set if there is no association specified.

Delaying resources version upgrade

As part of each resource reconciliation (eg. before updating a Deployment in the Kibana controller), inspect the annotations that make sense (eg. kibanaassociation.k8s.elastic.co/elasticsearch-version):

  • if not set, and no Elasticsearch is referenced in the Kibana resource, the controller can move on with any version upgrade
  • if not set, but an Elasticsearch is referenced in the Kibana resource, the association controller will likely set an annotation soon: requeue
  • if version expressed in the annotation is lower than the expected resource version, do not perform any action that would lead to creating an instance of the resource in the newer version. Update the resource status to phase: DelayingChanges (as opposed to: ApplyingChanges).
{
  "availableNodes": 1,
  "version": "7.7.0",
  "health": "green",
  "phase": "DelayingChanges"
}
  • if version expressed in the annotation is higher or equal to the expected resource version: move on with any change

This mechanism does not enforce a global stack upgrade order, but rather an implicit dependency graph between associated resources:

  • Kibana can only be upgraded after Elasticsearch
  • APM can only be upgraded after Elasticsearch, and after Kibana
  • Beats can only be upgraded after Elasticsearch, and after Kibana
  • Enterprise Search can only be upgraded after Elasticsearch

+1 to delaying rather then blocking. But, if delay is going to be significantly harder to implement then just blocking for now, maybe implement blocking first, then remove the blocking once delaying is ready? I didn't expect the breakage I saw when I updated both at once and kibana upgraded significantly faster then elastticsearch, causing an issue.

Differences between the design proposal above and the actual implementation:

  • we reuse the existing associationConf annotations instead of introducing a new set of annotations
  • phase: DelayingChanges reported in the status could be implemented in a follow-up
Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebgl picture sebgl  路  3Comments

djschny picture djschny  路  4Comments

spencergilbert picture spencergilbert  路  3Comments

deepaksinghcs14 picture deepaksinghcs14  路  4Comments

nkvoll picture nkvoll  路  6Comments