We inject the cluster.initial_master_nodes setting when creating the first Zen2 master nodes in a cluster. This is based on what is actually a temporary pod name.
The pod name may be changed just prior to actual pod creation if volumeClaimTemplates are used, and a pre-existing PVC is re-used. At this point we copy the pod name from the PVC, which invalidates the cluster.initial_master_nodes setting we injected earlier.
This seems to be the root cause for https://github.com/elastic/cloud-on-k8s/issues/1111.
Options (non-exhaustive) to fix:
Hi!
We think we are experiencing this issue on 0.8.1 but we are not 100% sure this is exactly this one. We're interested to know because we'd like to follow the progress on it (notably WRT the 0.9 release) :-)
What we do:
volumeClaims on all pods.kubectl delete 2 of the 3 masters to temporarily lose the quorumSide question: If we ever encounter this state, is there a hack/trick to get the cluster running again without recreating it from scratch (and loosing all the data)?
Thanks for all the work you put into this operator !
Since this should be fixed when we move to ssets (as pod names will be static) can we close this out? Or is there a part I'm misunderstanding?
Though keeping it open temporarily may be useful for current people running into this issue like our friend PaulGrandperrin above may be
This should definitely be fixed in 0.10 with the move to StatefulSets, but it isn't fixed (yet?) for 0.9, and is harder to fix.
I'd like to keep this open, hoping we can still fix it for 0.9. It's a pretty huge bug IMO.
- Only inject initial_master_nodes if no PVCs are found. (i.e a cluster that has been bootstrapped once should not be bootstrapped again automatically).
Variation of suggestion no. 2: How about we store the cluster UUID in an annotation on the Elasticsearch resource and don't write initial_master_nodes at all once the cluster has been bootstrapped once?
That should be relatively easy to fix for 0.9
Is it really specific to the PVC reuse mechanism ?
What if there's a failure right after the creation of the first pod:
IIUC it will also lead to a stale configuration and the cluster will never be able to elect a master.
I agree it is probably not specific to PVC reuse. PVC reuse was just where it was first observed?
Variation of suggestion no. 2: How about we store the cluster UUID in an annotation on the Elasticsearch resource and don't write
initial_master_nodesat all once the cluster has been bootstrapped once?
A side effect of relying on the cluster UUID _(which is already stored in the status, not sure we have to use an annotation)_ is that if the cluster is using some emptyDir and all the nodes are deleted it will not be able to recover. I'm not sure we want to cover that case but it would be a regression.
if the cluster is using some emptyDir and all the nodes are deleted it will not be able to recover
I think we may want to optimize for that particular case as a feature, not a bug: the cluster should not be able to recover.
(which is already stored in the status, not sure we have to use an annotation)
I guess because we upgrade the importance of cluster UUID from being purely informational to drive orchestration decisions. And because k8s API conventions dictate that status should be reconstructable. If someone deletes status we should be able to recreate it at some cost. With cluster UUID this would mask a potential loss of cluster state and we would make the wrong decision if we just recreate it with the current UUID
Corner cases should be fixed in next releases (>0.9) with the move to StatefulSets.
Noting for posterity, I think I see this on EcK 0.9.0:
% kubectl get pods -l elasticsearch.k8s.elastic.co/node-master=true
NAME READY STATUS RESTARTS AGE
bctest740-es-4bzcwrqsk2 1/1 Running 0 19m
bctest740-es-lfdk5bdwnq 1/1 Running 0 19m
bctest740-es-s9xb4rg289 1/1 Running 0 19m
# Ask a master about its initial master nodes:
% kubectl exec -it bctest740-es-lfdk5bdwnq -- grep -A3 'initial_master_nodes' config/elasticsearch.yml
initial_master_nodes:
- bctest740-es-j4bqq94vx5
- bctest740-es-k8t9bz6plv
- bctest740-es-kt2b2d4mtk
Most helpful comment
Variation of suggestion no. 2: How about we store the cluster UUID in an annotation on the Elasticsearch resource and don't write
initial_master_nodesat all once the cluster has been bootstrapped once?That should be relatively easy to fix for 0.9