Cloud-on-k8s: Improve zen1 support with StatefulSets

Created on 18 Jul 2019  路  19Comments  路  Source: elastic/cloud-on-k8s

How should we deal with zen1 in ECK reconciliation loops?

I have in mind the following, open for discussion:

Reconciliation:
---------------

We go through the following algorithm over and over again,
everytime something changes in k8s resources
(eg. a pod was created/deleted based on what we scheduled).

/!\ We don't explicitly create/delete nodes: we schedule them for
async creation/deletion (handled by k8s).

---------------

## 1. Retrieve state of the world

- get the expected nodes we should have
- get the nodes we currently have

## 2.  Create missing nodes

- create `elasticsearch.yml` for each missing node,
   with `minimum_master_nodes=(expected_master_nodes/2)+1`
- `expected_master_nodes` here does not represent the current
   state of resources, but the one we expect once all nodes are there
   (eg. if we add 2 new masters to 2 existing ones, the 2 new ones will have m_m_n=3)
- schedule missing nodes for creation (with their configuration)
- modify all existing nodes configuration to consider the updated
  `minimum_master_nodes`, in case they restart

## 3. Update cluster settings API

/!\ Important: at that time the nodes we scheduled for creation above
may not be created yet

- Update /_cluster/settings (both persistent and transient)
   with `minimum_master_nodes=(current_masters/2)+1`

`current_masters` here is ignoring any new node from step 2
that may not have been created yet.
Once these new nodes are actually created, we'll go through
the entire algorithm again.

## 4. Delete nodes

- Only delete one master node at a time.
- If node(s) to delete is a master:
    - special case if going from 2 masters to 1 master (unsafe):
         update /_cluster/settings with `minimum_master_nodes=1`
    - else (normal case): don't update /_cluster/settings yet
    - schedule the master node for deletion

Nodes are not effectively deleted here (async call),
but we'll go through the entire algorithm again when
the deletion actually happened, and update `/_cluster/settings` in step 3.

Thoughts would be appreciated @ywelsch @DaveCTurner :)

justdoit

Most helpful comment

@DaveCTurner is on vacation (for the next 2 weeks) so I'm answering here.

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

Voting configurations only make sense once the cluster has been Zen2-bootstrapped. In a rolling-upgrade, this bootstrapping automatically happens once the Zen2 nodes notice that all Zen1 master-eligible node have left the cluster. Only after bootstrapping (which establishes the first voting configuration) can a Zen2 node become master.

This means that in a rolling upgrade with mixed Zen1/Zen2 master-eligible nodes, as long as there is a Zen1 master, minimum_master_nodes should be adjusted, as there are no voting configurations (and therefore also no exclusions).

If you're expanding a 3 node cluster to 5 nodes during a 6.x to 7.x upgrade with a grow-and-shrink, it should look as follows:

  • initially have 3 6.x nodes, min_master_nodes set to 2
  • set min_master_nodes to 3, add 7.x node (3 is a majority of 4 nodes)
  • set min_master_nodes to 3, add 7.x node (3 is a majority of 5 nodes)
  • set min_master_nodes to 4, add 7.x node (4 is a majority of 6 nodes)
  • set min_master_nodes to 4, add 7.x node (4 is a majority of 7 nodes)
  • set min_master_nodes to 5, add 7.x node (5 is a majority of 8 nodes)
  • remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 7 nodes)
  • remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 6 nodes)
  • remove 6.x node (master will still be a 6.x node). This is the last Zen1 node. The Zen2 nodes will at that point auto-bootstrap with an initial voting configuration containing the 5 Zen nodes, and a Zen2 master will be elected. There is no need to adjust min_master_nodes (it can be set to null).

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

It's nonsensical because there are no voting configurations at that point.

Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?

It's non-sensical to do because once you have a Zen2 master, only voting configurations matter.

All 19 comments

A suggestion from @ywelsch on Slack:

The only perhaps a bit more controversial one is step 2, where the existing running masters are only made aware of the changes post-hoc, leaving a period where the minimum_master_nodes cluster setting is less than (total_master_nodes/2)+1. (edited)
One solution there would be to only add one master at a time (with the only exception during cluster bootstrapping perhaps)

Following https://github.com/elastic/cloud-on-k8s/pull/1262, I think we need to:

  • update m_m_n before deleting a master node in the case 2->1
  • probably make sure we add only one master node at a time (except when bootstrapping the cluster)
  • probably make sure we add only one master node at a time (except when bootstrapping the cluster)

Actually maybe that there is small improvement, if the current number n of master is odd you may be able to add 2 new masters at a time. Reason is that n+1 and n+2 masters share the same m_m_n

For example: if you have 3 nodes and user want to move to 5 _(or more)_ nodes then you can directly add the 2 new nodes because the new m_m_n is still 3 whatever there are 4 or 5 masters.

Actually maybe that there is small improvement, if the current number n of master is odd you may be able to add 2 new masters at a time. Reason is that n+1 and n+2 masters share the same m_m_n

True.
Not sure that's worth handling explicitly though, it would add a bit of complexity in the code for a rather small optimisation?

@DaveCTurner what's the recommended approach when dealing with a mixed v6/v7 cluster?

Say I'm moving from 3 v6.x nodes to 5 v7.x nodes.
Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?
Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

One way of removing this problem is maybe to make sure we don't perform (up|down)scales and rolling upgrades at the same time.

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

This will fail because the 6.x master node does not have the 7.x voting exclusion API

@DaveCTurner is on vacation (for the next 2 weeks) so I'm answering here.

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

Voting configurations only make sense once the cluster has been Zen2-bootstrapped. In a rolling-upgrade, this bootstrapping automatically happens once the Zen2 nodes notice that all Zen1 master-eligible node have left the cluster. Only after bootstrapping (which establishes the first voting configuration) can a Zen2 node become master.

This means that in a rolling upgrade with mixed Zen1/Zen2 master-eligible nodes, as long as there is a Zen1 master, minimum_master_nodes should be adjusted, as there are no voting configurations (and therefore also no exclusions).

If you're expanding a 3 node cluster to 5 nodes during a 6.x to 7.x upgrade with a grow-and-shrink, it should look as follows:

  • initially have 3 6.x nodes, min_master_nodes set to 2
  • set min_master_nodes to 3, add 7.x node (3 is a majority of 4 nodes)
  • set min_master_nodes to 3, add 7.x node (3 is a majority of 5 nodes)
  • set min_master_nodes to 4, add 7.x node (4 is a majority of 6 nodes)
  • set min_master_nodes to 4, add 7.x node (4 is a majority of 7 nodes)
  • set min_master_nodes to 5, add 7.x node (5 is a majority of 8 nodes)
  • remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 7 nodes)
  • remove 6.x node (master will still be a 6.x node), set min_master_nodes to 4 (4 is a majority of 6 nodes)
  • remove 6.x node (master will still be a 6.x node). This is the last Zen1 node. The Zen2 nodes will at that point auto-bootstrap with an initial voting configuration containing the 5 Zen nodes, and a Zen2 master will be elected. There is no need to adjust min_master_nodes (it can be set to null).

Is it safe to exclude a node from voting config exclusions if the current master node is a v6 node?

It's nonsensical because there are no voting configurations at that point.

Is it safe to update minimum_master_nodes through the API if the current master node is a v7 node?

It's non-sensical to do because once you have a Zen2 master, only voting configurations matter.

Thanks @ywelsch!
IIUC, a simple way to handle this in our code when we do some changes on the cluster:

  • if there is at least one v6 master node running, go the zen1 way (update minimum_master_nodes)
  • else (no v6 master node), go the zen2 way (voting config exclusions)

yes, makes sense.

@ywelsch one tricky bit raised in the PR for zen1 2->1 master special case is about availability of the master nodes. We can be in the following situation:

  • 3 master nodes cluster
  • 2 out of the 3 master nodes are actually running. One is unavailable for whatever reason (was just created but is not running yet, not enough resource on k8s, host is down, network issue, etc.)
  • The user wants to downscale from 3 to 2 master nodes, effectively removing one of the 2 running masters (this might be part of a larger 3->1 downscale).

Even though we downscale, in theory, from 3 to 2 nodes as seen from K8S, we actually downscale effectively from 2 to 1 nodes as seen from ES. If we decide to update zen1 minimum_master_nodes=1 before removing the node, but the third node comes back to life before the node is removed, we are temporarily running 3 masters with minimum_master_nodes=1.

This is made a bit harder since ECK is working with out-of-date nodes state (can see one node unavailable that would, in real-time reality (馃槃), be available).

We could also argue that the master to remove should be the unavailable one, but the user is allowed to express master nodes specification in different group of nodes (different config, availability zones, etc.) so may request explicitly the removal of one of the 2 running masters.

Any thoughts?

As you noticed, this type of down-scaling is completely unsafe, and the quickest way to get data loss. As @pebrc commented on the PR, you can't just take the currently available nodes into account to determine minimum_master_nodes (same applies to all other cases, not only those outlined here). I think that the operator should fail the down-scaling process in the above case.

We could decide to always make sure the expected masters are running before updating minimum_master_nodes through the ES API (not more from removals or rolling upgrades not completed yet, not less from creations not completed yet).
We would requeue any downscale operation until we reach this stable state.

I'm worried about cases where the user would be stuck with masters that will never go running. For example:

  • masters that were misconfigured (eg. unknown es setting)
  • master that cannot be scheduled on a k8s node (wrong affinity or no more resources available)

If we refuse to perform any downscale at this point, the user cannot remove those misconfigured StatefulSets?

If we refuse to perform any downscale at this point, the user cannot remove those misconfigured StatefulSets?

Could we detect some instances where such a node would be OK to remove. For example a bootlooping node (due to misconfiguration) would be detectable. I know it adds complexity but it might help us getting out of that situation.

We could decide to always make sure the expected masters are running before updating minimum_master_nodes through the ES API

馃憤 this makes sense to me. We know how many we expect to be running and can check how many are actually members of the cluster.

Personally I'm not too worried about handling a boot looping pod within the operator right now, that definitely seems like it can be an optimization for later. For now some manual work seems acceptable. We might need to increase the troubleshooting documentation as we run into more cases like that and then can think about automating them.

I think it's safe to close this now.
I created https://github.com/elastic/cloud-on-k8s/issues/1605 to focus on the remaining task of adding only one master node at a time, and how to deal with unavailable masters.

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster.

This is not quite true. It should be:

Zen2 nodes always prefer to elect Zen1 nodes as master as long as there's a master-eligible Zen1 node in the cluster and they have not already bootstrapped a Zen2 cluster.

The normal flow for a rolling upgrade from 6.x to 7.x is as Yannick describes, but there is a failure case that isn't covered. If more than half of the master-eligible nodes are upgraded to 7.x and then the remaining 6.x master-eligible nodes are temporarily disconnected then the 7.x nodes will assume the 6.x master-eligible nodes have been shut down and will automatically bootstrap a 7.x cluster. When the 6.x nodes reconnect they will discover the 7.x cluster and will join it. You will now be in a state where the process described in this comment doesn't do the right thing, because it will be assuming this is still a 6.x cluster and manipulating minimum_master_nodes.

As a more concrete example, imagine you are upgrading a 3-master-node cluster and get to the point of having two 7.x master nodes and one 6.x master node. If the 6.x master node is briefly disconnected then the 7.x master nodes will bootstrap a cluster and elect a master, and the voting configuration will be the elected master alone. If you now shut down the elected master without using a voting config exclusion then the cluster will fall apart, even if the other two master nodes are fine.

I raise this because I'm not sure what else might happen mid-upgrade. If an upgrade gets stuck for some reason and you need to restart a 7.x node to fix it then you may need to handle the cluster being unavailable for a while.

There's also a mechanism for preventing 6.x nodes from even joining a cluster that comprises entirely 7.x nodes, which might happen if the 6.x nodes are all temporarily disconnected.

@DaveCTurner thanks for the details.
What's the best way to know if a zen2 cluster has been bootstrapped?

If the cluster state has a nonempty voting configuration then it has definitely bootstrapped. However if you observe an empty voting configuration then we don't know: bootstrapping can technically happen at any time, so you may be looking at stale information.

The best way to handle this is, I think, to make sure that you can still escape from this kind of intermediate state by pushing forwards with the upgrade despite failures.

Closing this big issue in favor of keeping https://github.com/elastic/cloud-on-k8s/issues/1792 open to fix the corner case we discovered in the last few posts.

Other than that, everything looks fine with zen1 so far. We'll open other issues if we discover new bugs/corner cases.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Pandoraemon picture Pandoraemon  路  5Comments

sebgl picture sebgl  路  3Comments

SebastianCaceresUltra picture SebastianCaceresUltra  路  3Comments

sebgl picture sebgl  路  3Comments

djschny picture djschny  路  4Comments