I ran out of resources on a K8S cluster while doing an upscale of a set of MDI nodes.
I'm now in a situation where the nodeSet can't be downscaled because the operator is trying to exclude a master node which has never existed:
pod/es-apm-sample-es-1-0 1/1 Running 0 33m 10.56.3.18 gke-michael-dev-cluster-default-pool-8a982915-3msd <none> <none>
pod/es-apm-sample-es-1-1 0/1 Pending 0 26m <none> <none>
2019-10-14T07:30:50.277Z ERROR controller-runtime.controller Reconciler error
{
"ver": "1.0.0-beta1-bc11-c8bb5e5b",
"controller": "elasticsearch-controller",
"request": "default/es-apm-sample",
"error": "unable to add to voting_config_exclusions: 400 Bad Request: add voting config exclusions request for [es-apm-sample-es-1-1] matched no master-eligible nodes",
"errorCauses": [{
"error": "unable to add to voting_config_exclusions: 400 Bad Request: unknown",
"errorVerbose": "400 Bad Request: unknown
unable to add to voting_config_exclusions
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).AddVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:41\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.AddToVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:34\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.updateZenSettingsForDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:237\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.doDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:198\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.attemptDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:129\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.HandleDownscale\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/downscale.go:54\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:112\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:234\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:284\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:219\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"
}]
}
This is an interesting one.
Some ideas:
In both cases, there's a race condition:
The current code retries over and over again, hitting the same error until the node finally joins the cluster. But this could never happen if the node stays Pending or bootlooping forever.
We can detect a Pending or bootlooping Node, but we would still end up with the same race condition as above.
Note we do remove only one master node at a time, which mitigates the risks introduced with the above race condition.
I think we have the same sort of problem when setting allocation excludes in cluster settings, to migrate shards away from a data node before removing it. We have an easy way out though: it is possible to exclude a node that is not part of the cluster. The corresponding HTTP call does not fail.
@ywelsch @DaveCTurner I would appreciate your thoughts on this.
Ugh yes this is tricky.
Unfortunately it's necessary to know the node ID (not just its name) before we can exclude it from the voting configuration. If it's not in the cluster we don't know its node ID so we cannot exclude it, hence the exception.
Naively, if a node is not running then you don't need to play with the voting configuration to get rid of it safely. If the cluster is alive then the node in question wasn't needed for its votes, and if the cluster is dead then it's already too late. The main thing that worries me is that this node is still showing as Pending which suggests to me that it might come to life at some point in the future. If we knew it would certainly never start then life would be easier. Is that possible?
Unfortunately, "will not run in future" isn't quite enough. Nodes that are not running cannot join a cluster, but they could remain in a cluster for a short while after their deaths. I think that after stopping the node from running we need to ensure it is certainly out of the cluster. I don't think we provide an API to do this today.
I wonder if we should strengthen the voting config exclusions API to accept an unknown node name.
Ok the change to Elasticsearch is now merged to master and 7.x: We have replaced POST /_cluster/voting_config_exclusions/... with POST /_cluster/voting_config_exclusions?node_names=.... The existing API will be supported throughout the rest of 7.x but will result in deprecation warnings when used in ≥7.8.0.
It will shortly be removed in master but I will hold off on doing that for at least a week from now to give you some time to adapt to the new API without breaking your master builds.
Thanks for the heads up @DaveCTurner!
I suggest we keep this issue open for pre-8.0 clusters (we may decide to do nothing about it though).
And create a new one to track the necessary changes for 8.0.0: https://github.com/elastic/cloud-on-k8s/issues/2951.
I just realized that thanks to https://github.com/elastic/elasticsearch/pull/50836 we could already fix this for Elasticsearch 7.8+, by changing our call from /_cluster/voting_config_exclusions/node1,node2 to /_cluster/voting_config_exclusions? node_names=node1,node2 , which should properly ignore non-existing nodes. @DaveCTurner pointed this out already in his comment above, not sure how we missed it 😞.
Raising priority on this issue.
We have https://github.com/elastic/cloud-on-k8s/issues/2951 for the more focused fix of using the new query parameter
To workaround this situation when running Elasticsearch < 7.8 it's possible to edit the StatefulSet and scale down manually the number of replicas:
> kubectl get sts -l elasticsearch.k8s.elastic.co/cluster-name=<cluster-name>
NAME READY AGE
<cluster-name>-es-<nodeset> m/n 44h
> kubectl scale --replicas=m sts/<cluster-name>-es-<nodeset>
So are we going to add this workaround to our troubleshooting docs for <7.8 and close this issue?
Most helpful comment
I opened https://github.com/elastic/elasticsearch/issues/47990