Related to https://github.com/elastic/k8s-operators/issues/33
Use case:
Implementation options:
Some thoughts about how to implement configuration changes through secret volume mounts (and how that relates to full cluster restart).
I'm trying to figure out a way to reuse pods (restart es process with a different configuration) for particular configuration changes.
Full cluster restart being a special case of this.
What's not easy here is to consider that we could return early in every step of the way, and should still be able to eventually reconcile at next reconciliations. In other terms: there is no atomicity in operations we do. And we have the k8s client cache inconsistencies to keep in mind.
Algorithm:
1. Compare actual vs. expected => differences for (some|all) pods
2. There are PodSpecs to create, and pods to delete
3. Try to find a match between pods to create and pods to delete. eg. CanReuse(currentPod, forExpectedSpec). This needs to be deterministic among multiple reconciliation iterations since the reconciliation could return early in between any step here.
4. For each pod we can reuse:
a. evacuate it properly (set allocation excludes)
b. wait until all shards are migrated
c. call process manager's `POST /elasticsearch/stop` (needs to be an idempotent call)
d. wait for stopped by checking result of `GET /elasticsearch/status` (reported by the es observer?)
e. modify the configuration secret volume
f. wait til configuration secret is propagated (`GET /elasticsearch/status` should also report a checksum of `elasticsearch.yaml` ?)
g. start elasticsearch (`POST /elasticsearch/start`)
5. Restore allocation excludes = true for all pods that should be started
Full cluster restart when changing license is a special case of 4. that can be detected when all existing pods need to be deleted AND the changes include a particular license change (basic -> trial|gold|platinum or the other way around). In such case, skip the shard migration wait and go straight to step 4c.
One thing to resolve: once the configuration is changed through the secret volume mount, the new one will be used for comparison... but the old one is still used by the ES process. We should compare with the "old" configuration, not the new one that is not yet applied. How do we make steps 4e to 4g "atomic"? ie. if we exit the reconciliation loop after 4e, we still want to reach 4f at subsequent iterations. To enter that path, step1. still needs to compare the "old" configuration.
- For each pod we can reuse:
a. evacuate it properly (set allocation excludes)
I am not sure I follow why we have to evacuate the pod. Would that not defeat the purpose of reusing the pod? Not sure what kind of use cases you see for inline configuration changes that actually reuse the pod and need evacuation.
Just trying to sum up my current understanding (which might be lacking :-): any pod reuse related to auth config changes should work without a full cluster restart and also without evacuation. The only change I can think of atm that would require evacuation is a node type change (and maybe a plugin change). If we want to be able to reconfigure that without allocating a new pod then 馃憤
Side note: any spec change that touches resources prevents pod reuse I believe, correct?
Something that is maybe still missing in the algorithm is coordination across all nodes in a cluster. If I understand how restarts work, then we must not restart Elasticsearch with the new settings (e.g. TLS on/off) while other pods are still running with the old settings (not sure about that one actually needs verification)
I think we also might want to consider a synced flush to speed up shard recovery after the restart https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-synced-flush.html
And maybe disabling shard allocation before shutting down any nodes to avoid unnecessary IO created by Elasticsearch trying to replicate the shards that just went away on those nodes.
We should compare with the "old" configuration, not the new one that is not yet applied. How do we make steps 4e to 4g "atomic"? ie. if we exit the reconciliation loop after 4e, we still want to reach 4f at subsequent iterations
Coudl we use a variation of the expectation mechanism for that? Build expectations around API interactions with the process manager to wait for a condition (checksum reported back, cluster restarted)
You're right I think I was planning a step ahead here, to indeed reuse pods for other reasons (node type, plugin, custom user settings...). Anything that needs a configuration change and an ES restart, for which it would be nice to just reuse the same pod because data is already there. But this could also be achieved through reusing persistent volumes (#312), which would be more k8s-like.
I think it's safe in the short-term to ignore this feature and focus on the full-restart only (no evacuation).
Side note: any spec change that touches resources prevents pod reuse I believe, correct?
I think you're right. Related k8s issue (open and debated): https://github.com/kubernetes/kubernetes/issues/5774
Indeed this is not reflected well in the algorithm above. I guess step 4 needs to be entirely replaced for a full coordinated restart.
4. For all existing pods to be reused
a. stop them all
b. wait for all of them to be stopped
c. modify configuration secret volume
d. wait for configuration to be propagated
e. start them all
I'm trying to think about that one but so far I cannot see a clean way to make it resilient to multiple reconciliation iterations. Still thinking about it. Let's try to draft a high-level algorithm of how this would work?
One thing that _could_ work: make the process manager automatically start (or restart) the ES process whenever there is a change in the configuration file content on disk (ie. when a new configuration secret has been propagated to the pod).
Then it would be up to the operator to modify the configuration secret volume only when it knows it is safe for the ES process to be started or restarted.
That would remove step d and e from the algorithm in the comment above:
4. For all existing pods to be reused
a. stop them all
b. wait for all of them to be stopped
c. modify configuration secret volume (process-manager will restart ES automatically)
~REMOVED: d. wait for configuration to be propagated~
~REMOVED: e. start them all~
Let's say we returned after step c. In a subsequent reconciliation iteration:
That implicit cluster restart makes me slightly nervous.
I was thinking maybe we could track the number of restarts in the process manager and in the Elasticsearch status maybe observedRestarts: int, expectedRestarts: int and then once you observe a configuration change that needs a restart, you would increase the expected restarts in the status field, before you update the secret volume, and then compare the status of the process manager with that expected value.
In case of an early return
Interesting! Trying to understand it fully and dig into corner cases. Cold you try to outline a step-by-step algo similar to https://github.com/elastic/k8s-operators/issues/454#issuecomment-473269276? I think it helps to visualize how to recover between each step if the reconciliation loop restarts.
I was thinking maybe we could track the number of restarts in the process manager and in the Elasticsearch status maybe observedRestarts: int, expectedRestarts: int
Where would we store that?
In the operator runtime "memory": value lost after reboot.
In the elasticsearch resource in the apiserver: inconsistent value in the cache.
In the process manager: inconsistent value as seen from the operator if we read it async from the es observer.
before the status has been updated: it would retry updating the status
could we end up in a situation where eg. observedRestarts: 1, expectedRestarts: 15?
What if the es container restarts for a random reason here? observedRestarts: 0, expectedRestarts: 15?
after the secret has been updated:
it would wait until the process manager has seen the secret update using some revision number
Where would that revision number be stored? We cannot compute it dynamically from the secret content itself since the secret content we get at the beginning of an iteration might be out of date.
I guess the check would need to happen before any comparison between actual and expected pods? In other terms: we short-circuit any es reconciliation if there is one (actual) pod for which the secret content we see from the operator does not match the secret content seen from querying the process manager? If the pod is still in the init container phase, we probably need to skip this check and requeue.
issue the restart command
I don't know if it's safe to use a single "restart" command instead of "stop" and "start". What if the operator dies after issuing a restart for 3/5 pods?
issue the restart command if the process manager status indicates it has not received the command yet
How do we get into this path on a fresh reconciliation loop with the secret up to date. Should we check this at the beginning of the reconciliation (before any pod reconciliation)?
I will try to add a step-by-step thingy a bit later. Just trying to answer a few of your questions:
Where would we store that?
In the operator runtime "memory": value lost after reboot.
In the elasticsearch resource in the apiserver: inconsistent value in the cache.
In the process manager: inconsistent value as seen from the operator if we read it async from the es observer.
I think it needs to be in both places: the process manager should track the number of restarts locally. The operator needs to track them as well. We could store them in status of a CRD (consistent after restarts due to cache resync) _and_ in runtime state (consistent during non-interrupted operation)
could we end up in a situation where eg. observedRestarts: 1, expectedRestarts: 15?
We would not request another restart until both values equal out though, right?
Where would that revision number be stored? We cannot compute it dynamically from the secret content itself since the secret content we get at the beginning of an iteration might be out of date.
Could be another field in the secret that just replicates the k8s revision visible to the process manager so that it can return it in its status response. We would know in the operator that we need wait for a certain revision to become visible in the pod that is greater or equal to the one we write in to the secret and can set up an expectation for that.
I don't know if it's safe to use a single "restart" command instead of "stop" and "start". What if the operator dies after issuing a restart for 3/5 pods?
Yes absolutely, also for other reasons. I think that separating stop and start is essential to able to ensure the pod 'sees' all the necessary configuration before we start it up again.
How do we get into this path on a fresh reconciliation loop with the secret up to date. Should we check this at the beginning of the reconciliation (before any pod reconciliation)?
Yes so not sure about the implementation details, but the important bit here is that we see a difference between observed state and desired state, right? In this case assuming the secret would be already up to date we would (eventually) observe that the Elasticsearch process is not in the right state either because it is still stopped or because we are missing a restart in our comparison of restart counts.
Thinking about this for the past few days, I ended up writing a design proposal: https://github.com/elastic/k8s-operators/pull/551.
Compared to what we discussed above:
This PR is related https://github.com/elastic/k8s-operators/pull/604
Putting this on pause until we actually need it.
PR #604 contains a working implementation that we can copy or get inspiration from when the times comes :)
We decided we would revive some parts of PR #604, stripped out of any license-specific/TLS bit.
The goal here would be to handle full cluster restart on-demand (not dynamically decided by the operator). The longer-term goal is that the initial pod reuse implementation can serve as a basis to implement pod & pvc reuse during cluster mutations.
Most helpful comment
Thinking about this for the past few days, I ended up writing a design proposal: https://github.com/elastic/k8s-operators/pull/551.
Compared to what we discussed above: