Use case. Why is this important?
Nodes come and go on the cloud. I tell myself it's normal, but like, still makes me anxious ;P
Nodes also come and go during rolling restarts.
I've observed Elasticsearch on GKE/K8s taking 2+ minutes to startup very commonly. The default node_left timeout is 1m
The allocation of replica shards which become unassigned because a node has left can be delayed with the index.unassigned.node_left.delayed_timeout dynamic setting, which defaults to 1m.
I propose that ECK set, if possible, the default timeout to at least 3m, perhaps 5m, for this value: index.unassigned.node_left.delayed_timeout
This should allow rolling restarts or general pod replacements to recover without causing major replica movements.
Note: this setting used to be a valid setting you could put in the node's elasticsearch.yml, but now it's specifically an index setting, so if accepted, this may have to be implemented with a default index template?
Thanks for bringing this up. This makes sense to me, and yes it looks like it'll need to be a template. Agreed on at least 3m, maybe 5m. We can also add documentation in the meantime recommending people set it in a template.
I'm ++ to setting this for the user in the interim. Without this a user might expect minutes for a configuration change and end up taking hours / days because a node took 2 minutes instead of 1 minute.
I have a feeling this setting might be too low on the Elasticsearch side, and would be good to discuss if we can change the default there. Adding @eskibars @bleskes for consideration
For some context, with the default setting, we had at least one upgrade
which should have been a routine upgrade but took 3-4 days as data was
replicated/recovered. For small clusters this isn’t an issue, but this
cluster was ~20 data nodes and about 40tb of storage.
On Mon, Oct 7, 2019 at 7:31 PM Anurag Gupta notifications@github.com
wrote:
I'm ++ to setting this for the user in the interim. Without this a user
might expect minutes for a configuration change and end up taking hours /
days because a node took 2 minutes instead of 1 minute.I have a feeling this setting might be too low on the Elasticsearch side,
and would be good to discuss if we can change the default there. Adding
@eskibars https://github.com/eskibars @bleskes
https://github.com/bleskes for consideration—
You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub
https://github.com/elastic/cloud-on-k8s/issues/1579?email_source=notifications&email_token=AABAF2UXYRD2CULG6CHCXNLQNPWIHA5CNFSM4IL2VZLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEASNQ5Q#issuecomment-539285622,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABAF2WS7HA2SPPMNGHN5G3QNPWIHANCNFSM4IL2VZLA
.
I think we need to better understand why we take 2 min to start a node? Is that inherent to k8s (which may mean we need a default change) or something else is off.
Ps note that elasticsearch will cancel recoveries if the node comes back and has perfect shard copies on it (and more improvements are coming there)
I tend to think we should not apply our own default here. Also ECK 1.0.0-beta1 should help a bit. Some arguments below.
About rolling upgrades:
node_left.delayed_timeout should not be a problem when doing an Elasticsearch rolling upgrade. However this is only for ECK-scheduled Elasticsearch rolling upgrades. Upgrading the underlying Kubernetes cluster is a different thing. @jordansissel is the upgrade you mention an Elasticsearch upgrade or a Kubernetes upgrade?About the time required to start a node:
0.9 is a big improvement from ECK 0.8 since it does not rely anymore on our own process manager running in the ES container, doing some extra stuff before Elasticsearch is running. It also removes the need for the cert-initializer init container, that used to wait for ECK to provide an on-demand TLS certificate to the Pod via an HTTP request. Instead, we pre-generate the certificate in ECK and mount in the created Pod directly. ECK 1.0.0-beta1 keeps doing the same thing as 0.9 basically, but since it favors rolling upgrades over grow-and-shrink, we don't have to re-generate and mount certificates for all Pods to create. When the Pod is restarted, it reuses the certificate that is already there. This helped in brining restart time to a smaller value.About the default value for index.unassigned.node_left.delayed_timeout:
#config:
# node.master: false
# index.unassigned.node_left.delayed_timeout: 5m
in the Elasticsearch yaml spec. It's very explicit, there's no hidden magic here.
Edit: not true, users will have to go through ES API as per @jordansissel comment below.
I think we need to better understand why we take 2 min to start a node?
+1 though I haven't spent time on it. We upgrade so often and have so many other tasks we aren't focusing much on the "why" (besides documenting issues like this) and focusing more on the "make it work" to support the maintenance on our clusters.
Also +1 everything @sebgl said. Beyond that, setting a cluster-wide default for node_left is not possible in config: (which becomes elasticsearch.yml) because this setting is an index setting, not a cluster or node setting, and Elasticsearch rejects this setting in elasticsearch.yml (or it did fairly recently)
Based on recent feedbacks I think we can close this @jordansissel? Let's reopen if needed.
Most helpful comment
I tend to think we should not apply our own default here. Also ECK
1.0.0-beta1should help a bit. Some arguments below.About rolling upgrades:
node_left.delayed_timeoutshould not be a problem when doing an Elasticsearch rolling upgrade. However this is only for ECK-scheduled Elasticsearch rolling upgrades. Upgrading the underlying Kubernetes cluster is a different thing. @jordansissel is the upgrade you mention an Elasticsearch upgrade or a Kubernetes upgrade?About the time required to start a node:
0.9is a big improvement fromECK 0.8since it does not rely anymore on our own process manager running in the ES container, doing some extra stuff before Elasticsearch is running. It also removes the need for thecert-initializerinit container, that used to wait for ECK to provide an on-demand TLS certificate to the Pod via an HTTP request. Instead, we pre-generate the certificate in ECK and mount in the created Pod directly. ECK1.0.0-beta1keeps doing the same thing as0.9basically, but since it favors rolling upgrades over grow-and-shrink, we don't have to re-generate and mount certificates for all Pods to create. When the Pod is restarted, it reuses the certificate that is already there. This helped in brining restart time to a smaller value.About the default value for
index.unassigned.node_left.delayed_timeout:For the user, tweaking that value to what's best for their use case is as simple as specifying:in the Elasticsearch yaml spec. It's very explicit, there's no hidden magic here.Edit: not true, users will have to go through ES API as per @jordansissel comment below.