scylla service restart takes 20 minutes after enospc nemesis

Created on 27 Apr 2021  Â·  16Comments  Â·  Source: scylladb/scylla

scenario:

  1. run enospc nemesis to reach 99% used disk capacity on node-6.
  2. then delete all above created files, sleep a bit and restart scylla service.
  3. the timeout command for scylla service restart expired after 8 minutes.
  4. then it took another 12 minutes as a total of 20 minutes for node-6 to complete initialization.

logs:

< t:2021-04-21 09:52:59,448 f:cluster.py      l:1356 c:sdcm.cluster         p:DEBUG > 2021-04-21T09:52:47+00:00  longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !NOTICE  | sudo: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/bin/systemctl restart scylla-server.service

< t:2021-04-21 10:01:07,485 f:base.py         l:144  c:RemoteCmdRunner      p:ERROR > Error executing command: "sudo systemctl restart scylla-server.service"; Exit status: -1

2021-04-21T10:08:47+00:00  longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !INFO    | scylla: Scylla version 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62 starting ...
2021-04-21T10:08:49+00:00  longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !INFO    | scylla: [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62 starting ...
2021-04-21T10:12:02+00:00  longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !INFO    | scylla: [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 initialization completed.

Installation details
Scylla version (or git commit hash): 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62
Cluster size: 6 nodes (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-071018db599725edc (aws: eu-north-1)

Test: longevity-cdc-100gb-4h-test
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 4419244a-6461-4886-af1e-238712f0fd2c
Show all stored logs command: $ hydra investigate show-logs 4419244a-6461-4886-af1e-238712f0fd2c

Test id: 4419244a-6461-4886-af1e-238712f0fd2c

Logs:
prometheus - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/prometheus_snapshot_20210421_101534.tar.gz
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_100525/grafana-screenshot-overview-20210421_100525-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_100525/grafana-screenshot-scylla-per-server-metrics-nemesis-20210421_100907-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_101603/grafana-screenshot-overview-20210421_101604-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_101603/grafana-screenshot-scylla-per-server-metrics-nemesis-20210421_101915-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/db-cluster-4419244a.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/loader-set-4419244a.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/monitor-set-4419244a.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/sct-runner-4419244a.zip

Jenkins job URL

bug commit-log

All 16 comments

I expect its doing commitlog replay - but you can double check that in the log.

Calle - can you check and see if its commitlog replay that is taking a long time - or something else that is "strange" and do the first analysis.

Only node 6 is doing replay. That is one with a _lot_ of files (including recycled ones, which seem to indicate we are not filtering correctly). But it takes ~2m:

Apr 21 10:08:49 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62 starting ...

...
Apr 21 10:09:10 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] init - replaying commit log

....

Apr 21 10:11:32 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] commitlog_replayer - Log replay complete, 56191822 replayed mutations (0 invalid, 0 skipped)
Apr 21 10:11:32 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] init - replaying commit log - flushing memtables
Apr 21 10:11:37 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] init - replaying commit log - removing old commitlog segments

...
<compaction>
...

Apr 21 10:11:50 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] init - Waiting for gossip to settle before accepting client requests...

...
<compaction>
....
Apr 21 10:12:02 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]:  [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 initialization completed.

At this point, the rest of the cluster seems to have connected with the restarted node again, and everyone is back to doing compaction. Followed by some compaction.

I can't see anything "strange" as such. The node restarts, it does take almost 4 minutes total, but given the amount of replay and compaction happening before it even gets to gossip, that does not seem wholly unreasonable.

I guess it may be an issue the shutdown ... that takes long and then systemd kills scylla since we have a timeout of 15 minutes if I recall correctly.

Yes systemd kills the service

Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 0] storage_service - Drain on shutdown: tracing is stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 0] storage_service - Drain on shutdown: system distributed keyspace stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 0] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 0] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 0] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 1] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 9] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 4] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 1] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 3] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 4] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 7] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 4] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 5] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 7] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 7] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 8] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 12] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 8] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 12] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 11] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 8] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 11] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 3] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 11] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 6] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 5] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 6] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 5] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 6] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 12] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 2] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 2] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 2] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 1] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 13] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 13] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 13] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 3] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 10] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 10] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 10] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 9] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 9] hints_manager - Asked to stop
Apr 21 09:52:50 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 8] hints_manager - Stopped
Apr 21 09:52:51 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 11] hints_manager - Stopped
Apr 21 09:52:51 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 3] hints_manager - Stopped
Apr 21 09:52:52 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 12] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 13] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 4] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 1] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 6] hints_manager - Stopped
Apr 21 09:52:54 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 5] hints_manager - Stopped
Apr 21 09:52:55 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 10] hints_manager - Stopped
Apr 21 09:52:56 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 0] hints_manager - Stopped
Apr 21 09:52:56 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 7] hints_manager - Stopped
Apr 21 10:07:48 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 systemd[1]: scylla-server.service stop-sigterm timed out. Killing.
Apr 21 10:08:46 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 systemd[1]: scylla-server.service: main process exited, code=killed, status=9/KILL

the issue is with the hints manger stop (or so it seems) - shard 9 has two "Asked stop" only a single message "Stop" and all the others have two

Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 9] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 9] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]:  [shard 9] hints_manager - Asked to stop

so we have a case in which shutdown of hints manager seems to be stuck

@piodul / @bhalevy / @xemul rings a bell ?

(the version is 4.3.3)

@slivne , #8079

Thanks

@Piotr Dulikowski @.> / @Carl Wilund *@.*> can
you please confirm we need this backport to 4.3

On Mon, May 10, 2021 at 1:32 PM Pavel Emelyanov @.*>
wrote:

@slivne https://github.com/slivne , #8079
https://github.com/scylladb/scylla/issues/8079

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/8559#issuecomment-836529308,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCCYE5QOT3CHNACD5J3TM6Y37ANCNFSM43VOZOGA
.

Sorry, but the hints thing is _before_ node restart (first run shutdown). Replay etc is on the start after.

I somewhat misunderstood the issue somewhat. The concern is that the restart _as a whole_ timed out, not the actual node start?

Yes

And I don;t think the commitlog hard limit is part of 4.3 https://github.com/scylladb/scylla/commit/4082f57edc3e7764e11e9e858a0a8059981c6309

I can't really comment on whether added53b7d614280695d5468c08f401d6833cbab fixes this, but it seems likely, if it is a hints hang.

How the hard limit is not part of 4.3 so why do we need this "soft limit"

On Mon, May 10, 2021 at 1:46 PM Calle Wilund @.*>
wrote:

I can't really comment on whether added53
https://github.com/scylladb/scylla/commit/added53b7d614280695d5468c08f401d6833cbab
fixes this, but it seems likely, if it is a hints hang.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/8559#issuecomment-836541871,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCDIKFXP3LGXZOILFTLTM62QLANCNFSM43VOZOGA
.

Ugh, not at my best here today. It is not? Then I am confused.

If the commitlog didn't enforce a hard space limit in 4.3.3, then the patch which makes it soft for hints again won't help.

Maybe it's https://github.com/scylladb/scylla/issues/8577? It also had an ENOSPC nemesis. @bhalevy saw that commitlog shutdown might get stuck if an I/O error happens and sent a PR which should help with it (https://github.com/scylladb/scylla/pull/8578).

Looking at the logs of the node 6 (this issue), there was an I/O error in hints commitlog on shard 9:

2021-04-21T09:52:00+00:00  longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !ERR     | scylla: [shard 9] commitlog - Failed to persist commits to disk for HintsLog-1-162129586586159264.log: storage_io_error (Storage I/O error: 28: No space left on device)

One hints manager on shard 9 had a no space left on device. Possibly during a hints drain, which is includes a commitlog shutdown call. If the exception was hit here, it is possible this is hitting the problem @bhalevy addressed in #8578.

@piodul - samesis!

Thanks - #8578 is still not merged and this is not a blocker - its not a regression

Was this page helpful?
0 / 5 - 0 ratings