What did you do?
I scale down and change resources of the containers in the same time
What did you expect to see?
Data migrated before terminating the node.
What did you see instead? Under which circumstances?
Node was terminated and data was not migrated off the node.
Environment
ECK version:
1.2.1
Kubernetes information:
AKS 1.18.8
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"73ec19bdfc6008cd3ce6de96c663f70a69e2b8fc", GitTreeState:"clean", BuildDate:"2020-09-17T04:17:08Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-logs
namespace: elk
spec:
version: 7.9.2
nodeSets:
- name: data-2
count: 3 (BEFORE: 4)
config:
node.master: false
node.data: true
node.ingest: false
node.attr.data: warm
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 600Gi
storageClassName: fast
podTemplate:
spec:
tolerations:
- effect: NoSchedule
key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-logs
elasticsearch.k8s.elastic.co/node-data: "true"
topologyKey: kubernetes.io/hostname
initContainers:
- name: sysctl
securityContext:
privileged: true
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms20g -Xmx20g (BEFORE: Xms18g -Xmx18g)
resources:
requests:
memory: 25Gi (BEFORE: 21G)
limits:
memory: 25Gi (BEFORE: 21G)
{"log.level":"info","@timestamp":"2020-10-20T16:19:27.936Z","log.logger":"license-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":40,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:27.976Z","log.logger":"license-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":40,"namespace":"elk","es_name":"elasticsearch-logs","took":0.039793238}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.048Z","log.logger":"association.kb-es-association-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":43,"namespace":"elk","kb_name":"elasticsearch-logs","took":0.132928942}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.372Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","kind":"StatefulSet","namespace":"elk","name":"elasticsearch-logs-es-data-2"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.381Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","error":"admission webhook \"aks-webhook-admission-controller.azmk8s.io\" does not support dry run"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.397Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:28.526Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.408Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","failed_predicates":{"only_restart_healthy_node_if_green_or_yellow":["elasticsearch-logs-es-data-2-3","elasticsearch-logs-es-data-2-2","elasticsearch-logs-es-data-2-1","elasticsearch-logs-es-data-2-0"]}}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.486Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":208,"namespace":"elk","es_name":"elasticsearch-logs","took":1.584136303}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.486Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":209,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:29.520Z","log.logger":"association.kb-es-association-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":44,"namespace":"elk","kb_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:30.830Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":210,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:31.241Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:31.442Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:32.205Z","log.logger":"driver","message":"Disabling shards allocation","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"elasticsearch-logs","namespace":"elk"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:32.308Z","log.logger":"driver","message":"Requesting a synced flush","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"elasticsearch-logs","namespace":"elk"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:45.738Z","log.logger":"driver","message":"synced flush failed with 409 CONFLICT. Ignoring.","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:45.738Z","log.logger":"driver","message":"Deleting pod for rolling upgrade","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","es_name":"elasticsearch-logs","namespace":"elk","pod_name":"elasticsearch-logs-es-data-2-3","pod_uid":"932ac2e0-2eb4-4d1d-8185-a2abc071fb4a"}
{"log.level":"info","@timestamp":"2020-10-20T16:19:45.755Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":210,"namespace":"elk","es_name":"elasticsearch-logs","took":14.92511819}
{"log.level":"info","@timestamp":"2020-10-20T16:20:18.742Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":220,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.060Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.147Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.440Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","statefulset_name":"elasticsearch-logs-es-data-2","node":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.440Z","log.logger":"driver","message":"Scaling replicas down","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","statefulset_name":"elasticsearch-logs-es-data-2","from":4,"to":3}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.457Z","log.logger":"elasticsearch-controller","message":"Ending reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":220,"namespace":"elk","es_name":"elasticsearch-logs","took":0.714500954}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.457Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":221,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.460Z","log.logger":"transport","message":"Skipping pod because it has no IP yet","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","pod_name":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:27.916Z","log.logger":"elasticsearch-controller","message":"Starting reconciliation run","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","iteration":225,"namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:27.918Z","log.logger":"transport","message":"Skipping pod because it has no IP yet","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","pod_name":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.226Z","log.logger":"driver","message":"Deleting PVC","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","pvc_name":"elasticsearch-data-elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.239Z","log.logger":"generic-reconciler","message":"Updating resource","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","kind":"StatefulSet","namespace":"elk","name":"elasticsearch-logs-es-data-2"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.243Z","log.logger":"statefulset","message":"Pod validation skipped","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","error":"admission webhook \"aks-webhook-admission-controller.azmk8s.io\" does not support dry run"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.258Z","log.logger":"zen2","message":"Ensuring no voting exclusions are set","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.394Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"none_excluded"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.576Z","log.logger":"driver","message":"Cannot restart some nodes for upgrade at this time","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","failed_predicates":{"only_restart_healthy_node_if_green_or_yellow":["elasticsearch-logs-es-data-2-2","elasticsearch-logs-es-data-2-1","elasticsearch-logs-es-data-2-0"]}}
{"log.level":"info","@timestamp":"2020-10-20T16:20:28.620Z","log.logger":"driver","message":"Enabling shards allocation","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs"}
According to the logs data on elasticsearch-logs-es-data-2-3 has been migrated:
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.147Z","log.logger":"migrate-data","message":"Setting routing allocation excludes","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","es_name":"elasticsearch-logs","value":"elasticsearch-logs-es-data-2-3"}
{"log.level":"info","@timestamp":"2020-10-20T16:20:19.440Z","log.logger":"driver","message":"Data migration completed successfully, starting node deletion","service.version":"1.2.1-b5316231","service.type":"eck","ecs.version":"1.4.0","namespace":"elk","statefulset_name":"elasticsearch-logs-es-data-2","node":"elasticsearch-logs-es-data-2-3"}
Could you tell us how did you notice that the data has not been migrated ?
Could you tell us how did you notice that the data has not been migrated ?
I can try to reproduce if you want.
Thanks, I'll investigate and update the issue with my findings.
Some indices status went red
What number of replicas is configured for the affected indices ?
Also is the cluster still red or did it eventually recover ?
I have been able to reproduce. I have opened a new issue (https://github.com/elastic/cloud-on-k8s/issues/3867) as I think that the scope is a bit "broader" than just being able to upgrade and downscale at the same time.
I keep this one open to understand if we can mitigate it while we are working on the other one.
What number of replicas is configured for the affected indices ?
The affected indices had 0 replicas, indices with replicas went yellow but not red.
Also is the cluster still red or did it eventually recover ?
No, until we scaled up again the number of nodes. We removed the uuid from the ClaimRef on the pv (status goes from Released to Available) which had the data so the new pvc (created when scaling up) can be bound to this already existing pv.
Thanks for digging into this issue! (or related issues)
I'm closing this one as a fix has been merged in 1.3 (which should be available in a couple of days)
@mtparet Thank you for having raised this issue.
Most helpful comment
I have been able to reproduce. I have opened a new issue (https://github.com/elastic/cloud-on-k8s/issues/3867) as I think that the scope is a bit "broader" than just being able to upgrade and downscale at the same time.
I keep this one open to understand if we can mitigate it while we are working on the other one.