Cloud-on-k8s: Do not try to exclude a master node that never existed

Created on 14 Oct 2019  Â·  9Comments  Â·  Source: elastic/cloud-on-k8s

I ran out of resources on a K8S cluster while doing an upscale of a set of MDI nodes.

I'm now in a situation where the nodeSet can't be downscaled because the operator is trying to exclude a master node which has never existed:

pod/es-apm-sample-es-1-0                    1/1     Running   0          33m    10.56.3.18    gke-michael-dev-cluster-default-pool-8a982915-3msd   <none>           <none>
pod/es-apm-sample-es-1-1                    0/1     Pending   0          26m    <none>        <none>
2019-10-14T07:30:50.277Z    ERROR   controller-runtime.controller   Reconciler error
{
    "ver": "1.0.0-beta1-bc11-c8bb5e5b",
    "controller": "elasticsearch-controller",
    "request": "default/es-apm-sample",
    "error": "unable to add to voting_config_exclusions: 400 Bad Request: add voting config exclusions request for [es-apm-sample-es-1-1] matched no master-eligible nodes",
    "errorCauses": [{
        "error": "unable to add to voting_config_exclusions: 400 Bad Request: unknown",
        "errorVerbose": "400 Bad Request: unknown
unable to add to voting_config_exclusions
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).AddVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:41\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.AddToVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:34\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.updateZenSettingsForDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:237\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.doDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:198\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.attemptDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:129\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.HandleDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:54\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:112\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:234\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:284\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:219\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
    }]
}
>bug

Most helpful comment

All 9 comments

This is an interesting one.

Some ideas:

  1. Ignore the error when it happens and move on with removing the master node.
  2. Check if the master node is part of the cluster before we add it to voting_config_exclusions. Don't make the voting_config_exclusions call if the node is not part of the cluster.

In both cases, there's a race condition:

  • we don't add the master node to voting_config_exclusions (either from 1. or 2.)
  • the node joins the cluster <-- the operator does not notice yet
  • we remove the node, but it was not excluded from voting

The current code retries over and over again, hitting the same error until the node finally joins the cluster. But this could never happen if the node stays Pending or bootlooping forever.
We can detect a Pending or bootlooping Node, but we would still end up with the same race condition as above.

Note we do remove only one master node at a time, which mitigates the risks introduced with the above race condition.

I think we have the same sort of problem when setting allocation excludes in cluster settings, to migrate shards away from a data node before removing it. We have an easy way out though: it is possible to exclude a node that is not part of the cluster. The corresponding HTTP call does not fail.

@ywelsch @DaveCTurner I would appreciate your thoughts on this.

Ugh yes this is tricky.

Unfortunately it's necessary to know the node ID (not just its name) before we can exclude it from the voting configuration. If it's not in the cluster we don't know its node ID so we cannot exclude it, hence the exception.

Naively, if a node is not running then you don't need to play with the voting configuration to get rid of it safely. If the cluster is alive then the node in question wasn't needed for its votes, and if the cluster is dead then it's already too late. The main thing that worries me is that this node is still showing as Pending which suggests to me that it might come to life at some point in the future. If we knew it would certainly never start then life would be easier. Is that possible?

Unfortunately, "will not run in future" isn't quite enough. Nodes that are not running cannot join a cluster, but they could remain in a cluster for a short while after their deaths. I think that after stopping the node from running we need to ensure it is certainly out of the cluster. I don't think we provide an API to do this today.

I wonder if we should strengthen the voting config exclusions API to accept an unknown node name.

Ok the change to Elasticsearch is now merged to master and 7.x: We have replaced POST /_cluster/voting_config_exclusions/... with POST /_cluster/voting_config_exclusions?node_names=.... The existing API will be supported throughout the rest of 7.x but will result in deprecation warnings when used in ≥7.8.0.

It will shortly be removed in master but I will hold off on doing that for at least a week from now to give you some time to adapt to the new API without breaking your master builds.

Thanks for the heads up @DaveCTurner!

I suggest we keep this issue open for pre-8.0 clusters (we may decide to do nothing about it though).
And create a new one to track the necessary changes for 8.0.0: https://github.com/elastic/cloud-on-k8s/issues/2951.

I just realized that thanks to https://github.com/elastic/elasticsearch/pull/50836 we could already fix this for Elasticsearch 7.8+, by changing our call from /_cluster/voting_config_exclusions/node1,node2 to /_cluster/voting_config_exclusions? node_names=node1,node2 , which should properly ignore non-existing nodes. @DaveCTurner pointed this out already in his comment above, not sure how we missed it 😞.

Raising priority on this issue.

We have https://github.com/elastic/cloud-on-k8s/issues/2951 for the more focused fix of using the new query parameter

To workaround this situation when running Elasticsearch < 7.8 it's possible to edit the StatefulSet and scale down manually the number of replicas:

  1. Find the StatefulSet:
> kubectl get sts -l elasticsearch.k8s.elastic.co/cluster-name=<cluster-name>
NAME                       READY   AGE
<cluster-name>-es-<nodeset>   m/n     44h
  1. Adjust the number of replicas:
> kubectl scale --replicas=m  sts/<cluster-name>-es-<nodeset>

So are we going to add this workaround to our troubleshooting docs for <7.8 and close this issue?

Was this page helpful?
0 / 5 - 0 ratings