Cloud-on-k8s: [1.4.0] After a while, master nodes can't join the cluster anymore

Created on 1 Mar 2021  路  10Comments  路  Source: elastic/cloud-on-k8s

Bug Report

What did you do?
Initialized new clusters with 1.4.0.

What did you see instead? Under which circumstances?

After a while (~ a few days), a couple of my clusters randomly start to get an unknown status. Please note I currently have 10 different Elasticsearch clusters, and this happens randomly only a few of them.

I tried to remove the cluster & recreate another one. It works but after a few days another one is having the same issue.

Masters are returning some errors: SERVICE_UNAVAILABLE/1/state not recovered / initialized & MasterNotDiscoveredException.

master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered

What I tried to do

I managed to restart the master nodes by applying this to them

      cluster.initial_master_nodes:
        - elasticsearch-es-master-0
        - elasticsearch-es-master-1
        - elasticsearch-es-master-2

And restart them manually. It then started the cluster with a new id so it was not a proper solution...

Environment

  • ECK version: 1.4.0 (note that I've been using ECK 1.2.0 WITHOUT any issue). I never tried 1.3.0. And created new clusters from scratch after upgrade ECK to 1.4.0.

  • Kubernetes information:

Cloud : Gke 1.18.15-gke.1500

  • Resource definition:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
spec:
  auth:
    fileRealm:
      - secretName: realm
  http:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
    tls:
      certificate: {}
      selfSignedCertificate:
        disabled: true
  image: 'oceanpublic/elasticsearch:7.10.2-r1'
  nodeSets:
    - config:
        node.data: false
        node.ingest: false
        node.master: true
      count: 3
      name: master
      podTemplate:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                - podAffinityTerm:
                    labelSelector:
                      matchLabels:
                        common.k8s.elastic.co/type: elasticsearch
                    topologyKey: kubernetes.io/hostname
                  weight: 100
          containers:
            - command:
                - elasticsearch_exporter
                - >-
                  --es.uri=http://127.0.0.1:9200
                - '--es.all'
                - '--es.indices'
                - '--es.indices_settings'
                - '--es.shards'
                - '--es.snapshots'
                - '--es.timeout=30s'
                - '--web.listen-address=:9108'
                - '--web.telemetry-path=/metrics'
              image: 'justwatch/elasticsearch_exporter:1.1.0'
              livenessProbe:
                httpGet:
                  path: /
                  port: 9108
                  scheme: HTTP
              name: elasticsearch-exporter
              ports:
                - containerPort: 9108
                  name: http
                  protocol: TCP
              readinessProbe:
                httpGet:
                  path: /
                  port: 9108
                  scheme: HTTP
            - env:
                - name: ES_JAVA_OPTS
                  value: '-Xms1g -Xmx1g'
              name: elasticsearch
              resources:
                limits:
                  cpu: '4'
                  memory: 2Gi
                requests:
                  cpu: '1'
                  memory: 2Gi
          initContainers:
            - command:
                - sh
                - '-c'
                - sysctl -w vm.max_map_count=262144
              name: sysctl
              securityContext:
                privileged: true
          nodeSelector:
            preemptible: 'true'
          priorityClassName: critical
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-master
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
            storageClassName: standard
    - config:
        node.data: true
        node.ingest: true
        node.master: false
      count: 4
      name: data2
      podTemplate:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                - podAffinityTerm:
                    labelSelector:
                      matchLabels:
                        common.k8s.elastic.co/type: elasticsearch
                    topologyKey: kubernetes.io/hostname
                  weight: 100
          containers:
            - env:
                - name: ES_JAVA_OPTS
                  value: '-Xms1g -Xmx1g'
              name: elasticsearch
              resources:
                limits:
                  cpu: '4'
                  memory: 2Gi
                requests:
                  cpu: '1'
                  memory: 2Gi
          initContainers:
            - command:
                - sh
                - '-c'
                - sysctl -w vm.max_map_count=262144
              name: sysctl
              securityContext:
                privileged: true
          nodeSelector:
            preemptible: 'true'
          priorityClassName: critical
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 400Gi
            storageClassName: standard
  secureSettings:
    - secretName: gcs-credentials
  version: 7.10.2
>non-issue

Most helpful comment

One thing I noticed:

 volumeClaimTemplates:
        - metadata:
            name: elasticsearch-master
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
            storageClassName: standard

The volume claim template for all nodes, including master nodes, must be called elasticsearch-data. What happens here is that ECK creates a default elasticsearch-data 1GB volume claim for you (since you don't specify it), and does not specify a storage class, so the default one gets used.
Here your elasticsearch-master claim is basically created but unused, since the default elasticsearch-data is used instead.
Could the default storage class on your cluster point to a type of volume where data isn't properly persisted?

(Naming can indeed be confusing, elasticsearch-data is meant as "the data folder of Elasticsearch", and not "the content of data nodes").

All 10 comments

In an earlier version of your post you had an additional init container in your Elasticsearch manifest which ran the elasticsearch-node detach-cluster tool.

Screenshot 2021-03-01 at 14 03 20

The effects of running this tool would be consistent with what you described: master-eligible nodes would detach from the cluster and you would see the master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered message in the logs.

You should never run this tool unless in a disaster recovery scenario. Using it in an init container is particularly problematic as init containers will be run every time a Pod restarts.

1) Those were only applied to data nodes
2) That's what I used to restore the data nodes, because they couldn't join the cluster anymore (because different cluster id)

So the manifest that I shared is the right one that is crashing.

Sorry for I closed the issue by mistake.

One thing I noticed:

 volumeClaimTemplates:
        - metadata:
            name: elasticsearch-master
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
            storageClassName: standard

The volume claim template for all nodes, including master nodes, must be called elasticsearch-data. What happens here is that ECK creates a default elasticsearch-data 1GB volume claim for you (since you don't specify it), and does not specify a storage class, so the default one gets used.
Here your elasticsearch-master claim is basically created but unused, since the default elasticsearch-data is used instead.
Could the default storage class on your cluster point to a type of volume where data isn't properly persisted?

(Naming can indeed be confusing, elasticsearch-data is meant as "the data folder of Elasticsearch", and not "the content of data nodes").

Thanks for your answers @sebgl, @pebrc !

I didn't know about that. I'll change that at some point, but as there is no easy way to do that (without resetting the whole cluster, & restore a snapshot), it might take a while.

But I don't think it's the issue, because:
1) My default storage class is persistent
2) I'm using almost only preemptible instances, which means their lifespan is max 24h.

  • I would have had a lot of issues before
  • the issue can appear after several days
  • some clusters have been running for a week without any issue

But if you still believe that could be an issue, I'm definitely willing to give it a try quickly ?

Seeing master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered in the logs of master nodes definitely seems to indicate there was data loss on those master nodes: they cannot infer from their existing data that they were part of a cluster.
Is there a chance your PVCs were recreated at some point? You could look at the master nodes PVC & PV creation timestamp, and check whether that matches the master node data loss event.

kubectl get elasticsearch -n jaeger
kubectl get pvc -n jaeger
kubectl get pv -n jaeger

All return the same creation date.
On both clusters that became "unknown".

I can also try to rollback to 1.3.2, see if that changes anything.

I can also try to rollback to 1.3.2, see if that changes anything.

Going back to a previous version is not supported (unless you recreate all your workloads). You can only go forwards.

Ok, good to know, I wanted to make sure before trying. Thanks!

I'm a bit out of ideas right now.
I guess I'll recreate my clusters (+restore data) and monitor more closely to see if I can get more information for next time ?

Closing due to inactivity. Feel free to reopen if needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebgl picture sebgl  路  3Comments

pebrc picture pebrc  路  3Comments

sebgl picture sebgl  路  3Comments

sebgl picture sebgl  路  5Comments

SebastianCaceresUltra picture SebastianCaceresUltra  路  3Comments