Cloud-on-k8s: cluster.initial_master_nodes may not refer to existing pods.

Created on 5 Jul 2019 · 12Comments · Source: elastic/cloud-on-k8s

We inject the cluster.initial_master_nodes setting when creating the first Zen2 master nodes in a cluster. This is based on what is actually a temporary pod name.

The pod name may be changed just prior to actual pod creation if volumeClaimTemplates are used, and a pre-existing PVC is re-used. At this point we copy the pod name from the PVC, which invalidates the cluster.initial_master_nodes setting we injected earlier.

This seems to be the root cause for https://github.com/elastic/cloud-on-k8s/issues/1111.

>bug v0.9.0

Source

nkvoll

Most helpful comment

Only inject initial_master_nodes if no PVCs are found. (i.e a cluster that has been bootstrapped once should not be bootstrapped again automatically).

Variation of suggestion no. 2: How about we store the cluster UUID in an annotation on the Elasticsearch resource and don't write initial_master_nodes at all once the cluster has been bootstrapped once?

That should be relatively easy to fix for 0.9

pebrc on 15 Jul 2019

🚀1 👍1

All 12 comments

Options (non-exhaustive) to fix:

Inject initial_master_nodes after pod creation.
Only inject initial_master_nodes if no PVCs are found. (i.e a cluster that has been bootstrapped once should not be bootstrapped again automatically).
Do not copy pod names from the PVCs.

nkvoll on 5 Jul 2019

Hi!
We think we are experiencing this issue on 0.8.1 but we are not 100% sure this is exactly this one. We're interested to know because we'd like to follow the progress on it (notably WRT the 0.9 release) :-)

What we do:

create a cluster with 3 master-only nodes and whatever number of data-only nodes, with volumeClaims on all pods.
wait for it to be up, running and green.
kubectl delete 2 of the 3 masters to temporarily lose the quorum
the 2 deleted pods will be recreated with the same name and PVC but the masters will never be able to talk to each other again and the cluster is now definitively inoperable.

Side question: If we ever encounter this state, is there a hack/trick to get the cluster running again without recreating it from scratch (and loosing all the data)?

Thanks for all the work you put into this operator !

PaulGrandperrin on 12 Jul 2019

Since this should be fixed when we move to ssets (as pod names will be static) can we close this out? Or is there a part I'm misunderstanding?

Though keeping it open temporarily may be useful for current people running into this issue like our friend PaulGrandperrin above may be

anyasabo on 13 Jul 2019

This should definitely be fixed in 0.10 with the move to StatefulSets, but it isn't fixed (yet?) for 0.9, and is harder to fix.

I'd like to keep this open, hoping we can still fix it for 0.9. It's a pretty huge bug IMO.

sebgl on 15 Jul 2019

Only inject initial_master_nodes if no PVCs are found. (i.e a cluster that has been bootstrapped once should not be bootstrapped again automatically).

That should be relatively easy to fix for 0.9

pebrc on 15 Jul 2019

🚀1 👍1

Is it really specific to the PVC reuse mechanism ?
What if there's a failure right after the creation of the first pod:

3 pods to create
- first pod is created but not the other ones
- operator recovers later and create the 2 missing pods

IIUC it will also lead to a stale configuration and the cluster will never be able to elect a master.

barkbay on 15 Jul 2019

I agree it is probably not specific to PVC reuse. PVC reuse was just where it was first observed?

pebrc on 15 Jul 2019

👍1

Variation of suggestion no. 2: How about we store the cluster UUID in an annotation on the Elasticsearch resource and don't write initial_master_nodes at all once the cluster has been bootstrapped once?

A side effect of relying on the cluster UUID _(which is already stored in the status, not sure we have to use an annotation)_ is that if the cluster is using some emptyDir and all the nodes are deleted it will not be able to recover. I'm not sure we want to cover that case but it would be a regression.

barkbay on 16 Jul 2019

if the cluster is using some emptyDir and all the nodes are deleted it will not be able to recover

I think we may want to optimize for that particular case as a feature, not a bug: the cluster should not be able to recover.

sebgl on 16 Jul 2019

(which is already stored in the status, not sure we have to use an annotation)

I guess because we upgrade the importance of cluster UUID from being purely informational to drive orchestration decisions. And because k8s API conventions dictate that status should be reconstructable. If someone deletes status we should be able to recreate it at some cost. With cluster UUID this would mask a potential loss of cluster state and we would make the wrong decision if we just recreate it with the current UUID

pebrc on 16 Jul 2019

1272 should improve the situation

Corner cases should be fixed in next releases (>0.9) with the move to StatefulSets.

barkbay on 18 Jul 2019

Noting for posterity, I think I see this on EcK 0.9.0:

% kubectl get pods -l elasticsearch.k8s.elastic.co/node-master=true
NAME                      READY   STATUS    RESTARTS   AGE
bctest740-es-4bzcwrqsk2   1/1     Running   0          19m
bctest740-es-lfdk5bdwnq   1/1     Running   0          19m
bctest740-es-s9xb4rg289   1/1     Running   0          19m

# Ask a master about its initial master nodes:
% kubectl exec -it bctest740-es-lfdk5bdwnq -- grep -A3 'initial_master_nodes' config/elasticsearch.yml
  initial_master_nodes:
  - bctest740-es-j4bqq94vx5
  - bctest740-es-k8t9bz6plv
  - bctest740-es-kt2b2d4mtk