scenario:
logs:
< t:2021-04-21 09:52:59,448 f:cluster.py l:1356 c:sdcm.cluster p:DEBUG > 2021-04-21T09:52:47+00:00 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !NOTICE | sudo: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/bin/systemctl restart scylla-server.service
< t:2021-04-21 10:01:07,485 f:base.py l:144 c:RemoteCmdRunner p:ERROR > Error executing command: "sudo systemctl restart scylla-server.service"; Exit status: -1
2021-04-21T10:08:47+00:00 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !INFO | scylla: Scylla version 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62 starting ...
2021-04-21T10:08:49+00:00 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !INFO | scylla: [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62 starting ...
2021-04-21T10:12:02+00:00 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !INFO | scylla: [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 initialization completed.
Installation details
Scylla version (or git commit hash): 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62
Cluster size: 6 nodes (i3.4xlarge)
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-071018db599725edc (aws: eu-north-1)
Test: longevity-cdc-100gb-4h-test
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
====================================
PUT ISSUE DESCRIPTION HERE
====================================
Restore Monitor Stack command: $ hydra investigate show-monitor 4419244a-6461-4886-af1e-238712f0fd2c
Show all stored logs command: $ hydra investigate show-logs 4419244a-6461-4886-af1e-238712f0fd2c
Test id: 4419244a-6461-4886-af1e-238712f0fd2c
Logs:
prometheus - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/prometheus_snapshot_20210421_101534.tar.gz
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_100525/grafana-screenshot-overview-20210421_100525-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_100525/grafana-screenshot-scylla-per-server-metrics-nemesis-20210421_100907-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_101603/grafana-screenshot-overview-20210421_101604-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
grafana - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_101603/grafana-screenshot-scylla-per-server-metrics-nemesis-20210421_101915-longevity-cdc-100gb-4h-4-3-monitor-node-4419244a-1.png
db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/db-cluster-4419244a.zip
loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/loader-set-4419244a.zip
monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/monitor-set-4419244a.zip
sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/4419244a-6461-4886-af1e-238712f0fd2c/20210421_102553/sct-runner-4419244a.zip
I expect its doing commitlog replay - but you can double check that in the log.
Calle - can you check and see if its commitlog replay that is taking a long time - or something else that is "strange" and do the first analysis.
Only node 6 is doing replay. That is one with a _lot_ of files (including recycled ones, which seem to indicate we are not filtering correctly). But it takes ~2m:
Apr 21 10:08:49 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 with build-id c4a1db52eac953329ed2383fc23431f5c9cabf62 starting ...
...
Apr 21 10:09:10 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] init - replaying commit log
....
Apr 21 10:11:32 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] commitlog_replayer - Log replay complete, 56191822 replayed mutations (0 invalid, 0 skipped)
Apr 21 10:11:32 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] init - replaying commit log - flushing memtables
Apr 21 10:11:37 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] init - replaying commit log - removing old commitlog segments
...
<compaction>
...
Apr 21 10:11:50 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] init - Waiting for gossip to settle before accepting client requests...
...
<compaction>
....
Apr 21 10:12:02 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[66097]: [shard 0] init - Scylla version 4.3.3-0.20210419.28f5e0bd2 initialization completed.
At this point, the rest of the cluster seems to have connected with the restarted node again, and everyone is back to doing compaction. Followed by some compaction.
I can't see anything "strange" as such. The node restarts, it does take almost 4 minutes total, but given the amount of replay and compaction happening before it even gets to gossip, that does not seem wholly unreasonable.
I guess it may be an issue the shutdown ... that takes long and then systemd kills scylla since we have a timeout of 15 minutes if I recall correctly.
Yes systemd kills the service
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 0] storage_service - Drain on shutdown: tracing is stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 0] storage_service - Drain on shutdown: system distributed keyspace stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 0] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 0] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 0] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 1] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 9] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 4] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 1] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 3] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 4] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 7] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 4] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 5] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 7] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 7] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 8] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 12] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 8] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 12] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 11] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 8] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 11] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 3] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 11] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 6] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 5] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 6] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 5] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 6] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 12] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 2] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 2] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 2] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 1] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 13] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 13] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 13] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 3] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 10] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 10] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 10] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 9] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 9] hints_manager - Asked to stop
Apr 21 09:52:50 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 8] hints_manager - Stopped
Apr 21 09:52:51 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 11] hints_manager - Stopped
Apr 21 09:52:51 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 3] hints_manager - Stopped
Apr 21 09:52:52 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 12] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 13] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 4] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 1] hints_manager - Stopped
Apr 21 09:52:53 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 6] hints_manager - Stopped
Apr 21 09:52:54 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 5] hints_manager - Stopped
Apr 21 09:52:55 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 10] hints_manager - Stopped
Apr 21 09:52:56 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 0] hints_manager - Stopped
Apr 21 09:52:56 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 7] hints_manager - Stopped
Apr 21 10:07:48 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 systemd[1]: scylla-server.service stop-sigterm timed out. Killing.
Apr 21 10:08:46 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 systemd[1]: scylla-server.service: main process exited, code=killed, status=9/KILL
the issue is with the hints manger stop (or so it seems) - shard 9 has two "Asked stop" only a single message "Stop" and all the others have two
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 9] hints_manager - Asked to stop
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 9] hints_manager - Stopped
Apr 21 09:52:47 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 scylla[6500]: [shard 9] hints_manager - Asked to stop
so we have a case in which shutdown of hints manager seems to be stuck
@piodul / @bhalevy / @xemul rings a bell ?
(the version is 4.3.3)
@slivne , #8079
Thanks
@Piotr Dulikowski @.> / @Carl Wilund *@.*> can
you please confirm we need this backport to 4.3
On Mon, May 10, 2021 at 1:32 PM Pavel Emelyanov @.*>
wrote:
@slivne https://github.com/slivne , #8079
https://github.com/scylladb/scylla/issues/8079—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/8559#issuecomment-836529308,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCCYE5QOT3CHNACD5J3TM6Y37ANCNFSM43VOZOGA
.
Sorry, but the hints thing is _before_ node restart (first run shutdown). Replay etc is on the start after.
I somewhat misunderstood the issue somewhat. The concern is that the restart _as a whole_ timed out, not the actual node start?
Yes
And I don;t think the commitlog hard limit is part of 4.3 https://github.com/scylladb/scylla/commit/4082f57edc3e7764e11e9e858a0a8059981c6309
I can't really comment on whether added53b7d614280695d5468c08f401d6833cbab fixes this, but it seems likely, if it is a hints hang.
How the hard limit is not part of 4.3 so why do we need this "soft limit"
On Mon, May 10, 2021 at 1:46 PM Calle Wilund @.*>
wrote:
I can't really comment on whether added53
https://github.com/scylladb/scylla/commit/added53b7d614280695d5468c08f401d6833cbab
fixes this, but it seems likely, if it is a hints hang.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/8559#issuecomment-836541871,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCDIKFXP3LGXZOILFTLTM62QLANCNFSM43VOZOGA
.
Ugh, not at my best here today. It is not? Then I am confused.
If the commitlog didn't enforce a hard space limit in 4.3.3, then the patch which makes it soft for hints again won't help.
Maybe it's https://github.com/scylladb/scylla/issues/8577? It also had an ENOSPC nemesis. @bhalevy saw that commitlog shutdown might get stuck if an I/O error happens and sent a PR which should help with it (https://github.com/scylladb/scylla/pull/8578).
Looking at the logs of the node 6 (this issue), there was an I/O error in hints commitlog on shard 9:
2021-04-21T09:52:00+00:00 longevity-cdc-100gb-4h-4-3-db-node-4419244a-6 !ERR | scylla: [shard 9] commitlog - Failed to persist commits to disk for HintsLog-1-162129586586159264.log: storage_io_error (Storage I/O error: 28: No space left on device)
One hints manager on shard 9 had a no space left on device. Possibly during a hints drain, which is includes a commitlog shutdown call. If the exception was hit here, it is possible this is hitting the problem @bhalevy addressed in #8578.
@piodul - samesis!
Thanks - #8578 is still not merged and this is not a blocker - its not a regression