Cloud-on-k8s: Implement full cluster restart

Created on 26 Feb 2019  路  12Comments  路  Source: elastic/cloud-on-k8s

Related to https://github.com/elastic/k8s-operators/issues/33

Use case:

  • enable disable TLS #453
  • last resort to abort ES process spiraling out of control (?)

Implementation options:

  • Just destroy the pod and recreate

    • Problem: depending on storage class we might not be able to recreate the pod where the volume resides. Only recovery at this point is manual restore from snapshot

  • Inject a process manager into the standard Elasticsearch container/image

    • Would allow us to restart without recreating the pod unless we need to change the image at the same time (e.g. on major version upgrade) in which case the above applies

    • Has the disadvantage of being fairly intrusive and complex (copying binaries via initcontainers, overriding command etc)

  • Custom image with process manager

    • Has the same properties as above just that the complexity of injecting a process manager is replaced by the pain of releasing and maintaining custom images

  • Use a liveness probe to make k8s restart the container

    • hard to coordinate across an ES cluster

    • scheduling of restart is somewhat delayed (maybe not a big issue)

    • PID 1 zombie reaping problem

  • Recreate cluster and restore from snapshot

    • slow

    • needs a repo

  • Docker exec

    • either user or automated

    • PID 1 zombie reaping problem

Most helpful comment

Thinking about this for the past few days, I ended up writing a design proposal: https://github.com/elastic/k8s-operators/pull/551.

Compared to what we discussed above:

  • it does not require any additional expectations to be set up
  • it fixes some cache inconsistencies by relying on a state machine to handle restart
  • state machine is persisted through pod annotations
  • it also considers pod reuse for minor configuration change, rolling restarts and single-pod restarts (that we don't really need so far)

All 12 comments

485 picks up on the injected process manager idea and most discussion seems to happen there

Some thoughts about how to implement configuration changes through secret volume mounts (and how that relates to full cluster restart).
I'm trying to figure out a way to reuse pods (restart es process with a different configuration) for particular configuration changes.
Full cluster restart being a special case of this.
What's not easy here is to consider that we could return early in every step of the way, and should still be able to eventually reconcile at next reconciliations. In other terms: there is no atomicity in operations we do. And we have the k8s client cache inconsistencies to keep in mind.

Algorithm:

1. Compare actual vs. expected => differences for (some|all) pods
2. There are PodSpecs to create, and pods to delete
3. Try to find a match between pods to create and pods to delete. eg. CanReuse(currentPod, forExpectedSpec). This needs to be deterministic among multiple reconciliation iterations since the reconciliation could return early in between any step here.
4. For each pod we can reuse:
    a. evacuate it properly (set allocation excludes)
    b. wait until all shards are migrated
    c. call process manager's `POST /elasticsearch/stop` (needs to be an idempotent call)
    d. wait for stopped by checking result of `GET /elasticsearch/status` (reported by the es observer?)
    e. modify the configuration secret volume
    f. wait til configuration secret is propagated (`GET /elasticsearch/status` should also report a checksum of `elasticsearch.yaml` ?)
    g. start elasticsearch (`POST /elasticsearch/start`)
5. Restore allocation excludes = true for all pods that should be started

Full cluster restart when changing license is a special case of 4. that can be detected when all existing pods need to be deleted AND the changes include a particular license change (basic -> trial|gold|platinum or the other way around). In such case, skip the shard migration wait and go straight to step 4c.

One thing to resolve: once the configuration is changed through the secret volume mount, the new one will be used for comparison... but the old one is still used by the ES process. We should compare with the "old" configuration, not the new one that is not yet applied. How do we make steps 4e to 4g "atomic"? ie. if we exit the reconciliation loop after 4e, we still want to reach 4f at subsequent iterations. To enter that path, step1. still needs to compare the "old" configuration.

  1. For each pod we can reuse:
    a. evacuate it properly (set allocation excludes)

I am not sure I follow why we have to evacuate the pod. Would that not defeat the purpose of reusing the pod? Not sure what kind of use cases you see for inline configuration changes that actually reuse the pod and need evacuation.

Just trying to sum up my current understanding (which might be lacking :-): any pod reuse related to auth config changes should work without a full cluster restart and also without evacuation. The only change I can think of atm that would require evacuation is a node type change (and maybe a plugin change). If we want to be able to reconfigure that without allocating a new pod then 馃憤

Side note: any spec change that touches resources prevents pod reuse I believe, correct?

Something that is maybe still missing in the algorithm is coordination across all nodes in a cluster. If I understand how restarts work, then we must not restart Elasticsearch with the new settings (e.g. TLS on/off) while other pods are still running with the old settings (not sure about that one actually needs verification)

I think we also might want to consider a synced flush to speed up shard recovery after the restart https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-synced-flush.html

And maybe disabling shard allocation before shutting down any nodes to avoid unnecessary IO created by Elasticsearch trying to replicate the shards that just went away on those nodes.

We should compare with the "old" configuration, not the new one that is not yet applied. How do we make steps 4e to 4g "atomic"? ie. if we exit the reconciliation loop after 4e, we still want to reach 4f at subsequent iterations

Coudl we use a variation of the expectation mechanism for that? Build expectations around API interactions with the process manager to wait for a condition (checksum reported back, cluster restarted)

  • About evacuating pods:

You're right I think I was planning a step ahead here, to indeed reuse pods for other reasons (node type, plugin, custom user settings...). Anything that needs a configuration change and an ES restart, for which it would be nice to just reuse the same pod because data is already there. But this could also be achieved through reusing persistent volumes (#312), which would be more k8s-like.

I think it's safe in the short-term to ignore this feature and focus on the full-restart only (no evacuation).

  • Resource change

Side note: any spec change that touches resources prevents pod reuse I believe, correct?

I think you're right. Related k8s issue (open and debated): https://github.com/kubernetes/kubernetes/issues/5774

  • Coordinated restart

Indeed this is not reflected well in the algorithm above. I guess step 4 needs to be entirely replaced for a full coordinated restart.

4. For all existing pods to be reused
    a. stop them all
    b. wait for all of them to be stopped
    c. modify configuration secret volume
    d. wait for configuration to be propagated
    e. start them all
  • Reusing the expectations mechanism to avoid configuration change mismatch

I'm trying to think about that one but so far I cannot see a clean way to make it resilient to multiple reconciliation iterations. Still thinking about it. Let's try to draft a high-level algorithm of how this would work?

One thing that _could_ work: make the process manager automatically start (or restart) the ES process whenever there is a change in the configuration file content on disk (ie. when a new configuration secret has been propagated to the pod).

Then it would be up to the operator to modify the configuration secret volume only when it knows it is safe for the ES process to be started or restarted.

That would remove step d and e from the algorithm in the comment above:

4. For all existing pods to be reused
    a. stop them all
    b. wait for all of them to be stopped
    c. modify configuration secret volume (process-manager will restart ES automatically)
    ~REMOVED: d. wait for configuration to be propagated~
    ~REMOVED: e. start them all~

Let's say we returned after step c. In a subsequent reconciliation iteration:

  • Either the configuration secret in our client cache is up-to-date, meaning our pod comparison won't return any change to perform.
  • Either it's out of date and all pods have the "old" configuration: we'll go through the algorithm again and attempt the same operation (idempotent stop operation, updating the same secret with the same value, which will have no effect and/or result in a rejected update)
  • Either only part of the pods configuration secret is out of date. In this situation we want to avoid doing a full cluster restart again (waiting for all nodes to be stopped, since some of them might be restarting already). In such case, go through the algorithm again, but in a non-cluster coordinated mode (ie. pod by pod). How do we know to do non-coordinated? Because our comparison check did not report all pods to be deleted/reused.

That implicit cluster restart makes me slightly nervous.

I was thinking maybe we could track the number of restarts in the process manager and in the Elasticsearch status maybe observedRestarts: int, expectedRestarts: int and then once you observe a configuration change that needs a restart, you would increase the expected restarts in the status field, before you update the secret volume, and then compare the status of the process manager with that expected value.

In case of an early return

  • before the status has been updated: it would retry updating the status
  • after the status has been updated: it would update the secret volume
  • after the secret has been updated:

    • it would wait until the process manager has seen the secret update using some revision number

    • it would observe the difference in status reported by the process manager and Elasticsearch status attribute and wait or issue the restart command if the process manager status indicates it has not received the command yet

Interesting! Trying to understand it fully and dig into corner cases. Cold you try to outline a step-by-step algo similar to https://github.com/elastic/k8s-operators/issues/454#issuecomment-473269276? I think it helps to visualize how to recover between each step if the reconciliation loop restarts.

I was thinking maybe we could track the number of restarts in the process manager and in the Elasticsearch status maybe observedRestarts: int, expectedRestarts: int

Where would we store that?

In the operator runtime "memory": value lost after reboot.
In the elasticsearch resource in the apiserver: inconsistent value in the cache.
In the process manager: inconsistent value as seen from the operator if we read it async from the es observer.

before the status has been updated: it would retry updating the status

could we end up in a situation where eg. observedRestarts: 1, expectedRestarts: 15?

What if the es container restarts for a random reason here? observedRestarts: 0, expectedRestarts: 15?

after the secret has been updated:
it would wait until the process manager has seen the secret update using some revision number

Where would that revision number be stored? We cannot compute it dynamically from the secret content itself since the secret content we get at the beginning of an iteration might be out of date.

I guess the check would need to happen before any comparison between actual and expected pods? In other terms: we short-circuit any es reconciliation if there is one (actual) pod for which the secret content we see from the operator does not match the secret content seen from querying the process manager? If the pod is still in the init container phase, we probably need to skip this check and requeue.

issue the restart command

I don't know if it's safe to use a single "restart" command instead of "stop" and "start". What if the operator dies after issuing a restart for 3/5 pods?

issue the restart command if the process manager status indicates it has not received the command yet

How do we get into this path on a fresh reconciliation loop with the secret up to date. Should we check this at the beginning of the reconciliation (before any pod reconciliation)?

I will try to add a step-by-step thingy a bit later. Just trying to answer a few of your questions:

Where would we store that?

In the operator runtime "memory": value lost after reboot.
In the elasticsearch resource in the apiserver: inconsistent value in the cache.
In the process manager: inconsistent value as seen from the operator if we read it async from the es observer.

I think it needs to be in both places: the process manager should track the number of restarts locally. The operator needs to track them as well. We could store them in status of a CRD (consistent after restarts due to cache resync) _and_ in runtime state (consistent during non-interrupted operation)

could we end up in a situation where eg. observedRestarts: 1, expectedRestarts: 15?

We would not request another restart until both values equal out though, right?

Where would that revision number be stored? We cannot compute it dynamically from the secret content itself since the secret content we get at the beginning of an iteration might be out of date.

Could be another field in the secret that just replicates the k8s revision visible to the process manager so that it can return it in its status response. We would know in the operator that we need wait for a certain revision to become visible in the pod that is greater or equal to the one we write in to the secret and can set up an expectation for that.

I don't know if it's safe to use a single "restart" command instead of "stop" and "start". What if the operator dies after issuing a restart for 3/5 pods?

Yes absolutely, also for other reasons. I think that separating stop and start is essential to able to ensure the pod 'sees' all the necessary configuration before we start it up again.

How do we get into this path on a fresh reconciliation loop with the secret up to date. Should we check this at the beginning of the reconciliation (before any pod reconciliation)?

Yes so not sure about the implementation details, but the important bit here is that we see a difference between observed state and desired state, right? In this case assuming the secret would be already up to date we would (eventually) observe that the Elasticsearch process is not in the right state either because it is still stopped or because we are missing a restart in our comparison of restart counts.

Thinking about this for the past few days, I ended up writing a design proposal: https://github.com/elastic/k8s-operators/pull/551.

Compared to what we discussed above:

  • it does not require any additional expectations to be set up
  • it fixes some cache inconsistencies by relying on a state machine to handle restart
  • state machine is persisted through pod annotations
  • it also considers pod reuse for minor configuration change, rolling restarts and single-pod restarts (that we don't really need so far)

Putting this on pause until we actually need it.
PR #604 contains a working implementation that we can copy or get inspiration from when the times comes :)

We decided we would revive some parts of PR #604, stripped out of any license-specific/TLS bit.
The goal here would be to handle full cluster restart on-demand (not dynamically decided by the operator). The longer-term goal is that the initial pod reuse implementation can serve as a basis to implement pod & pvc reuse during cluster mutations.

Was this page helpful?
0 / 5 - 0 ratings