Supersedes #2426 and #2353 (see there for more context and discussion)
Currently each stack application is reconciled independently of each other. When a user upgrades the version of a group of linked resources (Kibana, ES, APMServer) this is violating our own documented stack upgrade procedure which clearly defines an upgrade order. The problems described in #2426 and #2353 would be avoided if the upgrade order would be enforced by ECK:
For that the Kibana and APM controllers would have to
Problems/Questions:
One thing that was mentioned on https://github.com/elastic/cloud-on-k8s/issues/2353#issuecomment-574358217 is to reject (as opposed to delay) the version upgrade at the validation webhook level. I think we should not go with that approach since it makes the user experience a bit painful. As a user I would expect I can update both Elasticsearch and Kibana version at the same time in the YAML manifest, then let ECK deal with it.
A while ago we tried to design the association controllers in such a way that they may run with different managed namespaces contexts. For example the Kibana controller may not have access to the Elasticsearch namespace, so it should not deal with Elasticsearch resources at all. I'm not sure where we stand now regarding this "constraint", but I think we should aim at keeping the various controllers responsibilities decoupled so that, for example:
Here is what I have in mind, I think it's pretty close to ideas @pebrc expressed in the first post:
Currently running version reflected in the status
Add an additional version field in the status all CRDs, set by each resource controller. It represents the lowest currently running version. For example, it would still indicate 7.7.0 if at least one ES node is running 7.7.0, even though other nodes are already running 7.8.0. This can be done by simply looking at the current StatefulSet|Deployment and Pods at the end of the Elasticsearch (Kibana, etc.) reconciliation. I think it can be useful beyond the scope of this PR, just to know what's the state of the current version upgrade.
{
"availableNodes": 1,
"version": "7.7.0",
"health": "green",
"phase": "ApplyingChanges"
}
Annotation set by the association controller in associated resources
In each association controller (eg. the Kibana-Elasticsearch association controller), annotate each resource where the association is specified (eg. Kibana) with the version of the resource it is associated to (eg. Elasticsearch), retrieved from its status. For example, if an Elasticsearch resource status reports version: 7.7.0, the association controller sets the following annotation in the Kibana resource:
kibanaassociation.k8s.elastic.co/elasticsearch-version: 7.7.0
This would be done generically by the association controller for any association:
The APM-Kibana association controller would set the following annotation on the APM resource:
apmassociation.k8s.elastic.co/kibana-version: 7.7.0
The Beat-ES association controller would set the following annotation on the Beat resource:
beatassociation.k8s.elastic.co/elasticsearch-version: 7.7.0
The Beat-Kibana association controller would set the following annotation on the beat resource:
beatassociation.k8s.elastic.co/kibana-version: 7.7.0
The association controller also ensures the annotation is not set if there is no association specified.
Delaying resources version upgrade
As part of each resource reconciliation (eg. before updating a Deployment in the Kibana controller), inspect the annotations that make sense (eg. kibanaassociation.k8s.elastic.co/elasticsearch-version):
phase: DelayingChanges (as opposed to: ApplyingChanges).{
"availableNodes": 1,
"version": "7.7.0",
"health": "green",
"phase": "DelayingChanges"
}
This mechanism does not enforce a global stack upgrade order, but rather an implicit dependency graph between associated resources:
+1 to delaying rather then blocking. But, if delay is going to be significantly harder to implement then just blocking for now, maybe implement blocking first, then remove the blocking once delaying is ready? I didn't expect the breakage I saw when I updated both at once and kibana upgraded significantly faster then elastticsearch, causing an issue.
Differences between the design proposal above and the actual implementation:
associationConf annotations instead of introducing a new set of annotationsphase: DelayingChanges reported in the status could be implemented in a follow-up
Most helpful comment
One thing that was mentioned on https://github.com/elastic/cloud-on-k8s/issues/2353#issuecomment-574358217 is to reject (as opposed to delay) the version upgrade at the validation webhook level. I think we should not go with that approach since it makes the user experience a bit painful. As a user I would expect I can update both Elasticsearch and Kibana version at the same time in the YAML manifest, then let ECK deal with it.
A while ago we tried to design the association controllers in such a way that they may run with different managed namespaces contexts. For example the Kibana controller may not have access to the Elasticsearch namespace, so it should not deal with Elasticsearch resources at all. I'm not sure where we stand now regarding this "constraint", but I think we should aim at keeping the various controllers responsibilities decoupled so that, for example:
Here is what I have in mind, I think it's pretty close to ideas @pebrc expressed in the first post:
Currently running version reflected in the status
Add an additional
versionfield in the status all CRDs, set by each resource controller. It represents the lowest currently running version. For example, it would still indicate7.7.0if at least one ES node is running7.7.0, even though other nodes are already running7.8.0. This can be done by simply looking at the current StatefulSet|Deployment and Pods at the end of the Elasticsearch (Kibana, etc.) reconciliation. I think it can be useful beyond the scope of this PR, just to know what's the state of the current version upgrade.Annotation set by the association controller in associated resources
In each association controller (eg. the Kibana-Elasticsearch association controller), annotate each resource where the association is specified (eg. Kibana) with the version of the resource it is associated to (eg. Elasticsearch), retrieved from its status. For example, if an Elasticsearch resource status reports
version: 7.7.0, the association controller sets the following annotation in the Kibana resource:This would be done generically by the association controller for any association:
The APM-Kibana association controller would set the following annotation on the APM resource:
The Beat-ES association controller would set the following annotation on the Beat resource:
The Beat-Kibana association controller would set the following annotation on the beat resource:
The association controller also ensures the annotation is not set if there is no association specified.
Delaying resources version upgrade
As part of each resource reconciliation (eg. before updating a Deployment in the Kibana controller), inspect the annotations that make sense (eg.
kibanaassociation.k8s.elastic.co/elasticsearch-version):phase: DelayingChanges(as opposed to:ApplyingChanges).This mechanism does not enforce a global stack upgrade order, but rather an implicit dependency graph between associated resources: