Cloud-on-k8s: When scaling down and changing container resources, data is not migrated

Created on 20 Oct 2020  路  7Comments  路  Source: elastic/cloud-on-k8s

What did you do?
I scale down and change resources of the containers in the same time

What did you expect to see?

Data migrated before terminating the node.

What did you see instead? Under which circumstances?

Node was terminated and data was not migrated off the node.

Environment

  • ECK version:

    1.2.1

  • Kubernetes information:
    AKS 1.18.8

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"73ec19bdfc6008cd3ce6de96c663f70a69e2b8fc", GitTreeState:"clean", BuildDate:"2020-09-17T04:17:08Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

  • Resource definition:
    I kept only the data nodes definition:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-logs
  namespace: elk
spec:
  version: 7.9.2
  nodeSets:
  - name: data-2
    count: 3 (BEFORE: 4)
    config:
      node.master: false
      node.data: true
      node.ingest: false
      node.attr.data: warm
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 600Gi
        storageClassName: fast
    podTemplate:
      spec:
        tolerations:
          - effect: NoSchedule
            key: kubernetes.azure.com/scalesetpriority
            operator: Equal
            value: spot
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchLabels:
                    elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-logs
                    elasticsearch.k8s.elastic.co/node-data: "true"
                topologyKey: kubernetes.io/hostname
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: -Xms20g -Xmx20g (BEFORE: Xms18g -Xmx18g)
          resources:
            requests:
              memory: 25Gi (BEFORE: 21G)
            limits:
              memory: 25Gi (BEFORE: 21G)

  • Logs:
{"log.level":"info","@timestamp":"2020-10-20T16:19:27.936Z","log.logger":"license-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":40,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:27.976Z","log.logger":"license-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":40,"namespace":"elk","es_name":"elasticsearch-logs","took":0.039793238}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.048Z","log.logger":"association.kb-es-association-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":43,"namespace":"elk","kb_name":"elasticsearch-logs","took":0.132928942}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.372Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","kind":"StatefulSet","namespace":"elk","name":"elasticsearch-logs-es-data-2"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.381Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","error":"admission webhook \"aks-webhook-admission-controller.azmk8s.io\" does not support dry run"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.397Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.526Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.408Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","failed_predicates":{"only_restart_healthy_node_if_green_or_yellow":["elasticsearch-logs-es-data-2-3","elasticsearch-logs-es-data-2-2","elasticsearch-logs-es-data-2-1","elasticsearch-logs-es-data-2-0"]}}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.486Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":208,"namespace":"elk","es_name":"elasticsearch-logs","took":1.584136303}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.486Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":209,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.520Z","log.logger":"association.kb-es-association-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":44,"namespace":"elk","kb_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:30.830Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":210,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:31.241Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:31.442Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:32.205Z","log.logger":"driver","message":"Disabling shards allocation","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"elasticsearch-logs","namespace":"elk"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:32.308Z","log.logger":"driver","message":"Requesting a synced flush","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"elasticsearch-logs","namespace":"elk"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:45.738Z","log.logger":"driver","message":"synced flush failed with 409 CONFLICT. Ignoring.","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:45.738Z","log.logger":"driver","message":"Deleting pod for rolling upgrade","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"elasticsearch-logs","namespace":"elk","pod_name":"elasticsearch-logs-es-data-2-3","pod_uid":"932ac2e0-2eb4-4d1d-8185-a2abc071fb4a"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:45.755Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":210,"namespace":"elk","es_name":"elasticsearch-logs","took":14.92511819}
{"log.level":"info","@timestamp":"2020-10-20T16:20:18.742Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":220,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.060Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.147Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.440Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","statefulset_name":"elasticsearch-logs-es-data-2","node":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.440Z","log.logger":"driver","message":"Scaling replicas down","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","statefulset_name":"elasticsearch-logs-es-data-2","from":4,"to":3}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.457Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":220,"namespace":"elk","es_name":"elasticsearch-logs","took":0.714500954}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.457Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":221,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.460Z","log.logger":"transport","message":"Skipping pod because it has no IP yet","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","pod_name":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:27.916Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":225,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:27.918Z","log.logger":"transport","message":"Skipping pod because it has no IP yet","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","pod_name":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.226Z","log.logger":"driver","message":"Deleting PVC","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","pvc_name":"elasticsearch-data-elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.239Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","kind":"StatefulSet","namespace":"elk","name":"elasticsearch-logs-es-data-2"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.243Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","error":"admission webhook \"aks-webhook-admission-controller.azmk8s.io\" does not support dry run"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.258Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.394Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"none_excluded"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.576Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","failed_predicates":{"only_restart_healthy_node_if_green_or_yellow":["elasticsearch-logs-es-data-2-2","elasticsearch-logs-es-data-2-1","elasticsearch-logs-es-data-2-0"]}}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.620Z","log.logger":"driver","message":"Enabling shards allocation","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
>bug

Most helpful comment

I have been able to reproduce. I have opened a new issue (https://github.com/elastic/cloud-on-k8s/issues/3867) as I think that the scope is a bit "broader" than just being able to upgrade and downscale at the same time.

I keep this one open to understand if we can mitigate it while we are working on the other one.

All 7 comments

According to the logs data on elasticsearch-logs-es-data-2-3 has been migrated:

{"log.level":"info","@timestamp":"2020-10-20T16:20:19.147Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.440Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","statefulset_name":"elasticsearch-logs-es-data-2","node":"elasticsearch-logs-es-data-2-3"}

Could you tell us how did you notice that the data has not been migrated ?

Could you tell us how did you notice that the data has not been migrated ?

  • The pod was terminated very soon (< 2min) after that although we had ~200GB data on it.
  • Some indices status went red

I can try to reproduce if you want.

Thanks, I'll investigate and update the issue with my findings.

Some indices status went red

What number of replicas is configured for the affected indices ?
Also is the cluster still red or did it eventually recover ?

I have been able to reproduce. I have opened a new issue (https://github.com/elastic/cloud-on-k8s/issues/3867) as I think that the scope is a bit "broader" than just being able to upgrade and downscale at the same time.

I keep this one open to understand if we can mitigate it while we are working on the other one.

What number of replicas is configured for the affected indices ?

The affected indices had 0 replicas, indices with replicas went yellow but not red.

Also is the cluster still red or did it eventually recover ?

No, until we scaled up again the number of nodes. We removed the uuid from the ClaimRef on the pv (status goes from Released to Available) which had the data so the new pvc (created when scaling up) can be bound to this already existing pv.

Thanks for digging into this issue! (or related issues)

I'm closing this one as a fix has been merged in 1.3 (which should be available in a couple of days)

@mtparet Thank you for having raised this issue.

Was this page helpful?
0 / 5 - 0 ratings