Describe the bug
After/during node reboot (preemptible node in GKE, reboot time is limited to ~30 seconds), ingester pod goes into LEAVING state forever, until the pod is deleted manually.

To Reproduce
Deploy with helm on preempt nodes. Every time node is preempted, one of the ingesters goes to LEAVING and keeps running in this state (/ready return Ready, so pod is marked as healthy and ready).
Expected behavior
Ingester should either heal itself or exit after unclean restart and finding previous state LEAVING
Environment:
Deployment is done from loki helm chart 0.25.1
Loki version string: version=v1.3.0, branch=HEAD, revision=5e527a03
Helm values file:
loki:
extraArgs:
log.level: debug
serviceMonitor:
enabled: true
interval: 10s
scrapeTimeout: 5s
config:
schema_config:
configs:
- from: "2019-07-01"
store: cassandra
object_store: cassandra
schema: v10
index:
prefix: index
period: 360h
chunks:
prefix: chunk
period: 360h
storage_config:
cassandra:
username: cassandra
password: cassandra
addresses: cassandra
auth: true
keyspace: loki
ingester:
lifecycler:
ring:
replication_factor: 2
kvstore:
store: etcd
etcd:
endpoints:
- etcd:2379
nameOverride: loki
fullnameOverride: loki
replicas: 5
resources:
limits:
cpu: 500m
#memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
persistence:
enabled: true
size: 10Gi
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: cloud.google.com/gke-preemptible
operator: Exists
weight: 100
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: loki
topologyKey: "kubernetes.io/hostname"
Promtail(daemonset)/cassandra(sts)/etcd(sts) are deployed separately to the same preempt nodes and experience no issues, except promtail not able to push logs to Loki after 3/5 of them go to LEAVING.
Screenshots, Promtail config, or terminal output
Discussion in slack: https://grafana.slack.com/archives/CEPJRLQNL/p1584242451454400
Logs for starting Loki instance after node reboot:
level=info ts=2020-03-15T02:20:21.386901677Z caller=loki.go:149 msg=initialising module=server
level=info ts=2020-03-15T02:20:21.387793334Z caller=server.go:121 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2020-03-15T02:20:21.388305543Z caller=loki.go:149 msg=initialising module=overrides
level=info ts=2020-03-15T02:20:21.388499931Z caller=override.go:53 msg="per-tenant overrides disabled"
level=info ts=2020-03-15T02:20:21.388729433Z caller=loki.go:149 msg=initialising module=table-manager
level=info ts=2020-03-15T02:20:21.636389303Z caller=loki.go:149 msg=initialising module=store
level=info ts=2020-03-15T02:20:21.639229202Z caller=table_manager.go:220 msg="synching tables" expected_tables=36
...
level=info ts=2020-03-15T02:20:22.067044325Z caller=loki.go:149 msg=initialising module=ring
level=info ts=2020-03-15T02:20:22.067743921Z caller=loki.go:149 msg=initialising module=querier
level=debug ts=2020-03-15T02:20:22.067846777Z caller=modules.go:166 msg="initializing querier worker" config="{Address: Parallelism:10 DNSLookupDuration:10s GRPCClientConfig:{MaxRecvMsgSize:104857600 MaxSendMsgSize:16777216 UseGzipCompression:false RateLimit:0 RateLimitBurst:0 BackoffOnRatelimits:false BackoffConfig:{MinBackoff:100ms MaxBackoff:10s MaxRetries:10}}}"
level=info ts=2020-03-15T02:20:22.067909526Z caller=worker.go:72 msg="no address specified, not starting worker"
level=info ts=2020-03-15T02:20:22.086818825Z caller=loki.go:149 msg=initialising module=ingester
level=info ts=2020-03-15T02:20:22.087478269Z caller=loki.go:149 msg=initialising module=distributor
level=info ts=2020-03-15T02:20:22.087612976Z caller=loki.go:149 msg=initialising module=all
level=info ts=2020-03-15T02:20:22.087631552Z caller=main.go:70 msg="Starting Loki" version="(version=v1.3.0, branch=HEAD, revision=5e527a03)"
level=info ts=2020-03-15T02:20:22.089785918Z caller=lifecycler.go:471 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2020-03-15T02:20:22.100261899Z caller=lifecycler.go:505 msg="existing entry found in ring" state=LEAVING tokens=128 ring=ingester
level=debug ts=2020-03-15T02:20:22.100369162Z caller=lifecycler.go:359 msg="JoinAfter expired" ring=ingester
level=debug ts=2020-03-15T02:20:26.168712738Z caller=logging.go:44 traceID=17132fd9b604a73b msg="GET /metrics (200) 27.310659ms"
level=debug ts=2020-03-15T02:20:32.345468852Z caller=logging.go:44 traceID=74f88119023ce136 msg="GET /metrics (200) 9.792214ms"
Problem might come from cortex not having a path out from LEAVING state on startup: https://github.com/cortexproject/cortex/blob/3374e33bde0e805f9955fc8ba4caf0a4b96e04ca/pkg/ring/lifecycler.go#L363
So this is something we do experience from time to time. The workaround for now is to forget the leaving ingester.
I want to also point out that the helm chart wasn't meant to be scaled, for that we prefer to recommend tanka production/ksonnet.
Also let's ask our ring expert @pstibrany.
Are you running ingesters on those preemptible nodes? From what I'm reading, I would strongly recommend not doing that. Ingesters take a while to shutdown correctly (during rollout of new version, new ingester is starting while another one is stopping, and stopping ingester then tries to handover its data to starting one). You can disable handover by using -ingester.max-transfer-retries=0 (or max_transfer_retries field under ingester in YAML). This will tell ingester to flush data to storage only. Note that it can still take a while.
If ingester is left in ring in LEAVING state, it means that it has failed to perform its shutdown correctly – and then it's best to find out why. When running on preemptible nodes, it is very likely that ingester has been killed before it was able to handover data or flush it properly, since pods on preemptible are not shutdown gracefully. This is bad, and can lead to data loss.
@pstibrany yep, it's preemptible pool. But:
Anyway, skipping arguing about preempt pros and cons, do you think staying in LEAVING forever is a normal behavior? If I delete the pod which is in LEAVING, it shuts down gracefully and goes ACTIVE after restart, why it does not shut itself down after some meaningful timeout while in LEAVING? I don't see the need in manual action here.
In other words, I'm OK with partial data loss (logs from single node due to unexpected shutdown) - that's why I use preempt nodes. I'm not OK with application requiring manual restart to heal - data loss already happened, why wait further?
So this is something we do experience from time to time. The workaround for now is to forget the leaving ingester.
Forget does not work for me, node just reappears LEAVING in a matter of seconds
If ingester is OOMkilled, that's bad and can result in loss of data. That's also why we use replication. Other Loki components are not stateful (well, query frontends keep queue of queries, but if you lose queries, it's not a big deal). Ingesters keep lot of data in memory, and need proper shutdown.
It's interesting that you point out that ingester reappears with LEAVING State... it means it's running. I think that's because when new ingester started, it simply picked up state that was in the ring (LEAVING) and is now running with this state. You can confirm by checking for existing entry found in ring entry in the log.
I think best what you can do is to 1) stop this ingester, 2) forget it from the ring if it is still there after stopping, 3) restart it agan.
In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.
it means it's running
It is. The thing is that I'm using 'all in one' image, not sure how it'll work in case of single ingester process.
In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.
@pstibrany thanks Peter! Looking for it.
For the reference, here is relevant change: https://github.com/cortexproject/cortex/pull/1921.
In upcoming Loki release, this is handled automatically -- if ingester finds entry in LEAVING state and number of tokens matches what it is configured with, it will switch the state to ACTIVE instead.
Sounds good, any date for the next release?
Sounds good, any date for the next release?
real soon now, I think 😉
/cc @slim-bean
UPDATE: unfortunately, I was testing it on non-preemptible nodes by mistake. On preemptible it keeps failing even with -ingester.max-transfer-retries=0 :( I'm moving this to non-preemptible nodes.
@pstibrany, I'm testing below w/a for a few days now, looks like it works - I haven't seen ingester in the LEAVING state since changing configuration. Looks like 30 seconds (preemption time) is enough for ingester to flush data to storage.
You can disable handover by using
-ingester.max-transfer-retries=0(ormax_transfer_retriesfield underingesterin YAML). This will tell ingester to flush data to storage only. Note that it can still take a while.
This behavior can also be forced by sending a SIGKILL or deleting pods with grace-period=0.
Maybe it makes sense to test this in a chaoskube way?
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
The problem is still here (non-preemptible machine isn't a fix)
So I would like to add something regarding the SIGTERM / SIGKILL with GKE preemptible Nodes. The explanation done by the docs about preemptibles is for "normal" instances and not for workers.
It seems GKE is not picking up this signal to terminate pods gradually. So what happens is that the pod is instantly killed, which makes -ingester.max-transfer-retries=0 unusable. There is no time for any retry or whatsoever.
To be able to notify your pods of termination, you will need to following Daemonset running on your preemtibles.
I just installed it and didn't do any tests yet, but just wanted to inform you 😄
I've experiment k8s-node-termination-handler, it's better but still need some tests (and time) for ensure that ""solve"" the problem
Just fews information for make this "working"
--system-pod-grace-period=0s) -> give 30sec for graceful shutdownserver.graceful_shutdown_timeout=5s, ingester.lifecycler.final_sleep=0s and ingester.max_transfer_retries=20max_transfer_retries > 20 ? Because is faster to transfer chunk to the new ingester than to flush to the store (on my side, it was impossible to flush in less than 30sec)final_sleep -> care here, if you expose prometheus metrics you can lost itFIY, when SIGTERM happens ingester do (in this order)
ingester.lifecycler.final_sleep secondBut it's important to mention that work only in best case scenario, if node crash or shutdown is longer than 30sec -> ingester will be stuck in unhealthy state
EDIT: in the last 1h, 4 ingesters was stuck in unhealthy state, I guess this isn't a viable solution ...
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Most helpful comment
real soon now, I think 😉
/cc @slim-bean