Cloud-on-k8s: Set default node_left timeout to 5m?

Created on 15 Aug 2019  Â·  8Comments  Â·  Source: elastic/cloud-on-k8s

Proposal

Use case. Why is this important?

Nodes come and go on the cloud. I tell myself it's normal, but like, still makes me anxious ;P

Nodes also come and go during rolling restarts.

I've observed Elasticsearch on GKE/K8s taking 2+ minutes to startup very commonly. The default node_left timeout is 1m

The allocation of replica shards which become unassigned because a node has left can be delayed with the index.unassigned.node_left.delayed_timeout dynamic setting, which defaults to 1m.

I propose that ECK set, if possible, the default timeout to at least 3m, perhaps 5m, for this value: index.unassigned.node_left.delayed_timeout

This should allow rolling restarts or general pod replacements to recover without causing major replica movements.

>docs >enhancement

Most helpful comment

I tend to think we should not apply our own default here. Also ECK 1.0.0-beta1 should help a bit. Some arguments below.

About rolling upgrades:

  • ECK 1.0.0-beta1 has a new reworked orchestration algorithm that handles Elasticsearch rolling upgrades as a first-class concept. If you upgrade a 3-nodes cluster to a 3-nodes cluster with a different spec (k8s podTemplate or Elasticsearch configuration), we do a clean rolling upgrade as specified in Elasticsearch documentation. Which includes:

    • best-effort sync-flush on all nodes

    • disabling shards allocation for the duration of the upgrade, so shards are not moved around since we know the node will come back online soon

  • Thanks to the last point, node_left.delayed_timeout should not be a problem when doing an Elasticsearch rolling upgrade. However this is only for ECK-scheduled Elasticsearch rolling upgrades. Upgrading the underlying Kubernetes cluster is a different thing. @jordansissel is the upgrade you mention an Elasticsearch upgrade or a Kubernetes upgrade?
  • To explain how it's different: when doing a rolling upgrade of the underlying Kubernetes cluster, k8s nodes are usually drained slowly. Our new default PodDisruptionBudget will basically make sure only one Pod at a time can be taken down by Kubernetes, so the cluster stays in a yellow/green health. The Pod is removed by Kubernetes (not ECK), then automatically re-scheduled (by the StatefulSet controller, not ECK) to another k8s node. From ECK perspective, we don't really know if the ES node will ever come back, so we cannot disable shards allocation here. Maybe at some point in the future we could optimize for that situation however, and find a way to acknowledge in ECK that the user is doing a k8s rolling upgrade so we expect ES nodes to come back online pretty soon. In the meantime, the user can still disable shards allocation using the ES API.

About the time required to start a node:

  • On a quickstart cluster (3 nodes, 1GB heap, 8 cores on GKE) with not much data in it, each node takes about 40 seconds to be restarted. Including the time for the Pod to be terminated, then scheduled again with the same data volume, then for the init container to prepare the filesystem, then for the Elasticsearch process to start, join the cluster, and for the cluster health to be green again. Our init container does add up to the total time it takes here, and we may be able to optimize a bit, but not much IMO. The actual run time of the init container script is very small (a few seconds).
  • The time required to start the Elasticsearch Pod highly depends on the podTemplate cpu requirements associated to it, and the number of ES shards that need to be initialized. We can't do much about that. 2 minutes sounds likely possible here.
  • ECK 0.9 is a big improvement from ECK 0.8 since it does not rely anymore on our own process manager running in the ES container, doing some extra stuff before Elasticsearch is running. It also removes the need for the cert-initializer init container, that used to wait for ECK to provide an on-demand TLS certificate to the Pod via an HTTP request. Instead, we pre-generate the certificate in ECK and mount in the created Pod directly. ECK 1.0.0-beta1 keeps doing the same thing as 0.9 basically, but since it favors rolling upgrades over grow-and-shrink, we don't have to re-generate and mount certificates for all Pods to create. When the Pod is restarted, it reuses the certificate that is already there. This helped in brining restart time to a smaller value.

About the default value for index.unassigned.node_left.delayed_timeout:

  • In general, our approach is to use Elasticsearch own defaults for everything. Unless we have to tweak them so the cluster can work (eg. TLS certificates, nodes discovery, etc.). We stay close to the stack documentation, and avoid doing anything that moves us away from how you would expect Elasticsearch to run by default. We don't want users to have to compare ECK documentation with Elasticsearch documentation to understand how default settings are different.
  • If the way things work change in future Elasticsearch versions, we'll have to maintain different defaults for different Elasticsearch versions, which is a bit of a mess.
  • It's hard to pick-up an arbitrary value here. If we think 1 min is too small for most use cases, I think we should change the default value in the Stack, not in ECK. Kubernetes does add the overhead of creating then scheduling Pods, plus bootstrapping Docker containers. I think running Elasticsearch in Kubernetes should be considered a standard way of running Elasticsearch nowadays, and the stack should adapt accordingly if judged necessary.
  • For the user, tweaking that value to what's best for their use case is as simple as specifying:
#config:
#   node.master: false
#   index.unassigned.node_left.delayed_timeout: 5m

in the Elasticsearch yaml spec. It's very explicit, there's no hidden magic here.
Edit: not true, users will have to go through ES API as per @jordansissel comment below.

  • If we still feel like we should do something to optimize Elasticsearch configuration for Kubernetes, I would vote for documenting it in the official doc instead, that already suggests a few tweaks for production-grade clusters.

All 8 comments

Note: this setting used to be a valid setting you could put in the node's elasticsearch.yml, but now it's specifically an index setting, so if accepted, this may have to be implemented with a default index template?

Thanks for bringing this up. This makes sense to me, and yes it looks like it'll need to be a template. Agreed on at least 3m, maybe 5m. We can also add documentation in the meantime recommending people set it in a template.

I'm ++ to setting this for the user in the interim. Without this a user might expect minutes for a configuration change and end up taking hours / days because a node took 2 minutes instead of 1 minute.

I have a feeling this setting might be too low on the Elasticsearch side, and would be good to discuss if we can change the default there. Adding @eskibars @bleskes for consideration

For some context, with the default setting, we had at least one upgrade
which should have been a routine upgrade but took 3-4 days as data was
replicated/recovered. For small clusters this isn’t an issue, but this
cluster was ~20 data nodes and about 40tb of storage.

On Mon, Oct 7, 2019 at 7:31 PM Anurag Gupta notifications@github.com
wrote:

I'm ++ to setting this for the user in the interim. Without this a user
might expect minutes for a configuration change and end up taking hours /
days because a node took 2 minutes instead of 1 minute.

I have a feeling this setting might be too low on the Elasticsearch side,
and would be good to discuss if we can change the default there. Adding
@eskibars https://github.com/eskibars @bleskes
https://github.com/bleskes for consideration

—
You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub
https://github.com/elastic/cloud-on-k8s/issues/1579?email_source=notifications&email_token=AABAF2UXYRD2CULG6CHCXNLQNPWIHA5CNFSM4IL2VZLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEASNQ5Q#issuecomment-539285622,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABAF2WS7HA2SPPMNGHN5G3QNPWIHANCNFSM4IL2VZLA
.

I think we need to better understand why we take 2 min to start a node? Is that inherent to k8s (which may mean we need a default change) or something else is off.

Ps note that elasticsearch will cancel recoveries if the node comes back and has perfect shard copies on it (and more improvements are coming there)

I tend to think we should not apply our own default here. Also ECK 1.0.0-beta1 should help a bit. Some arguments below.

About rolling upgrades:

  • ECK 1.0.0-beta1 has a new reworked orchestration algorithm that handles Elasticsearch rolling upgrades as a first-class concept. If you upgrade a 3-nodes cluster to a 3-nodes cluster with a different spec (k8s podTemplate or Elasticsearch configuration), we do a clean rolling upgrade as specified in Elasticsearch documentation. Which includes:

    • best-effort sync-flush on all nodes

    • disabling shards allocation for the duration of the upgrade, so shards are not moved around since we know the node will come back online soon

  • Thanks to the last point, node_left.delayed_timeout should not be a problem when doing an Elasticsearch rolling upgrade. However this is only for ECK-scheduled Elasticsearch rolling upgrades. Upgrading the underlying Kubernetes cluster is a different thing. @jordansissel is the upgrade you mention an Elasticsearch upgrade or a Kubernetes upgrade?
  • To explain how it's different: when doing a rolling upgrade of the underlying Kubernetes cluster, k8s nodes are usually drained slowly. Our new default PodDisruptionBudget will basically make sure only one Pod at a time can be taken down by Kubernetes, so the cluster stays in a yellow/green health. The Pod is removed by Kubernetes (not ECK), then automatically re-scheduled (by the StatefulSet controller, not ECK) to another k8s node. From ECK perspective, we don't really know if the ES node will ever come back, so we cannot disable shards allocation here. Maybe at some point in the future we could optimize for that situation however, and find a way to acknowledge in ECK that the user is doing a k8s rolling upgrade so we expect ES nodes to come back online pretty soon. In the meantime, the user can still disable shards allocation using the ES API.

About the time required to start a node:

  • On a quickstart cluster (3 nodes, 1GB heap, 8 cores on GKE) with not much data in it, each node takes about 40 seconds to be restarted. Including the time for the Pod to be terminated, then scheduled again with the same data volume, then for the init container to prepare the filesystem, then for the Elasticsearch process to start, join the cluster, and for the cluster health to be green again. Our init container does add up to the total time it takes here, and we may be able to optimize a bit, but not much IMO. The actual run time of the init container script is very small (a few seconds).
  • The time required to start the Elasticsearch Pod highly depends on the podTemplate cpu requirements associated to it, and the number of ES shards that need to be initialized. We can't do much about that. 2 minutes sounds likely possible here.
  • ECK 0.9 is a big improvement from ECK 0.8 since it does not rely anymore on our own process manager running in the ES container, doing some extra stuff before Elasticsearch is running. It also removes the need for the cert-initializer init container, that used to wait for ECK to provide an on-demand TLS certificate to the Pod via an HTTP request. Instead, we pre-generate the certificate in ECK and mount in the created Pod directly. ECK 1.0.0-beta1 keeps doing the same thing as 0.9 basically, but since it favors rolling upgrades over grow-and-shrink, we don't have to re-generate and mount certificates for all Pods to create. When the Pod is restarted, it reuses the certificate that is already there. This helped in brining restart time to a smaller value.

About the default value for index.unassigned.node_left.delayed_timeout:

  • In general, our approach is to use Elasticsearch own defaults for everything. Unless we have to tweak them so the cluster can work (eg. TLS certificates, nodes discovery, etc.). We stay close to the stack documentation, and avoid doing anything that moves us away from how you would expect Elasticsearch to run by default. We don't want users to have to compare ECK documentation with Elasticsearch documentation to understand how default settings are different.
  • If the way things work change in future Elasticsearch versions, we'll have to maintain different defaults for different Elasticsearch versions, which is a bit of a mess.
  • It's hard to pick-up an arbitrary value here. If we think 1 min is too small for most use cases, I think we should change the default value in the Stack, not in ECK. Kubernetes does add the overhead of creating then scheduling Pods, plus bootstrapping Docker containers. I think running Elasticsearch in Kubernetes should be considered a standard way of running Elasticsearch nowadays, and the stack should adapt accordingly if judged necessary.
  • For the user, tweaking that value to what's best for their use case is as simple as specifying:
#config:
#   node.master: false
#   index.unassigned.node_left.delayed_timeout: 5m

in the Elasticsearch yaml spec. It's very explicit, there's no hidden magic here.
Edit: not true, users will have to go through ES API as per @jordansissel comment below.

  • If we still feel like we should do something to optimize Elasticsearch configuration for Kubernetes, I would vote for documenting it in the official doc instead, that already suggests a few tweaks for production-grade clusters.

I think we need to better understand why we take 2 min to start a node?

+1 though I haven't spent time on it. We upgrade so often and have so many other tasks we aren't focusing much on the "why" (besides documenting issues like this) and focusing more on the "make it work" to support the maintenance on our clusters.

Also +1 everything @sebgl said. Beyond that, setting a cluster-wide default for node_left is not possible in config: (which becomes elasticsearch.yml) because this setting is an index setting, not a cluster or node setting, and Elasticsearch rejects this setting in elasticsearch.yml (or it did fairly recently)

Based on recent feedbacks I think we can close this @jordansissel? Let's reopen if needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

anyasabo picture anyasabo  Â·  3Comments

sebgl picture sebgl  Â·  5Comments

pebrc picture pebrc  Â·  3Comments

thbkrkr picture thbkrkr  Â·  5Comments

spencergilbert picture spencergilbert  Â·  3Comments