Scylla: test_bank_with_nemesis times out with new I/O scheduler

Created on 20 Dec 2020 · 35Comments · Source: scylladb/scylla

Regression seen since https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/691/testReport/lwt_banking_load_test/LWTBankingLoadTest/test_bank_with_nemesis/
Bisected down to 18be57a4e5b7c3df6b51d44a68c144e46f74a8f7
that includes scylladb/seastar@800f87a4398c5e4a529f00c872ab3e1aa548d96c

There are many thousands of timeout reports in the logs, for example:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/691/artifact/logs-release.2/orphaned/1607964940_lwt_banking_load_test.LWTBankingLoadTest.test_bank_with_nemesis/node1.log

INFO  2020-12-14 17:01:06,208 [shard 1] rpc - client 127.0.0.1:49529 msg_id 683:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,217 [shard 1] rpc - client 127.0.0.1:59041 msg_id 1200:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:50945 msg_id 916:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:64273 msg_id 820:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:64273 msg_id 816:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:50945 msg_id 920:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
WARN  2020-12-14 17:01:06,234 [shard 1] seastar - Exceptional future ignored: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.), backtrace:    0x3c5047e
   0x3c508f0
   0x3c50c78
   0x387bef7
   0x1ebda0f
   0x38c6b8f
   0x38c7d77
   0x38e6448
   0x38924da
   /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/c60da2e90da1c5bc58a7799bd67220b9237d13cc/scylla/libreloc/libpthread.so.0+0x93f8
   /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/c60da2e90da1c5bc58a7799bd67220b9237d13cc/scylla/libreloc/libc.so.6+0x101b02
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false> >(seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>

bug high

Source

bhalevy

Most helpful comment

@bhalevy I ran git blame again and finally found something I was looking for and somehow missed earlier (or maybe the last time I looked it wasn't there). 51037e94ec fixes this issue. prune used to not have a handler that catches all kinds of exceptions in case of failure, which resulted in Exceptional future ignored in the logs.

psarna on 5 Mar 2021

👍2

All 35 comments

@xemul please look into this issue.
It is highly reproducible.

bhalevy on 20 Dec 2020

@bhalevy , my bisect pointed to the 27fba358 commit. Can you please re-check on your side?

xemul on 21 Dec 2020

@psarna any idea how the hell this is related ?

slivne on 22 Dec 2020

The error type for local timeouts changed from write failure to write timeout, so perhaps one kind is correctly caught with try-catch, and the other isn't? I'll try to dig more deeply and find out.

psarna on 22 Dec 2020

Maybe htis can help -- the test in question generates tons of expected injected_error-s from service/paxos/paxos_state.cc paxos_state::prepare().

xemul on 22 Dec 2020

Maybe htis can help -- the test in question generates tons of expected injected_error-s from service/paxos/paxos_state.cc paxos_state::prepare().

@xemul but error injection is disabled in debug build, isn't it?
Where did you see these tons of expected injected errors? Did you try to reproduce in dev mode?

bhalevy on 23 Dec 2020

It was dev mode, yes. @bhalevy

xemul on 23 Dec 2020

ok, so they are not relevant to the timeouts seen in dtest-release mode

bhalevy on 23 Dec 2020

Can I get a clarification on the bisection? I got lost a little. The original description says 18be57a, later it was bisected to 27fba35 - but is this second bisect valid, or was it a red herring due to running in dev mode? @bhalevy @xemul

psarna on 23 Dec 2020

@psarna , the 18be57a is the commit updating seastar, that brought new IO scheduler into scylla.
_After_ it there came the 39afe14a, which is the merge commit of your 'Add per query timeout' branch (carrying the 27fba35 on board).

I did 4 tests on local dev-build:

On former commit -- several timeouts in logs (I saw ~10 timeouts on every "good" run)
On latter commit -- thousands (literally 2000 and more) of timeouts in logs
On latter commit with new IO scheduler patched-out by hands -- thousands timeouts in logs again
Unbiased git-bisect between those two, checking for the number of timeouts in logs to be higher than 100 -- it lead me to the 27fba35 commit

xemul on 23 Dec 2020

@psarna ping

slivne on 1 Mar 2021

On it. I need to pinpoint the place where the future which adds stuff to paxos tables is ignored, and then it will be easier to investigate why the number of warnings increased.

psarna on 1 Mar 2021

Um, can somebody confirm this issue still happens? I just tested on master (release mode) and encountered 0 timeouts.

psarna on 1 Mar 2021

Maybe it's hardware dependent.

avikivity on 1 Mar 2021

If possible, it would help very much to run the reproducer (if somebody has a working one) with trace logging level for paxos, in order to see which operation causes the timeouts. One of the operations that is left in the background must not handle the timeout error properly, but there's many of them in ~~paxos_state.cc~~storage_proxy.cc alone and I'm not able to easily guess which one causes trouble.

psarna on 1 Mar 2021

I'll try to repeat this on my node a bit later and report back.

xemul on 1 Mar 2021

@psarna , I've reproduced it locally on scylla mastera & scylla-dtest commit d1041a56, here are the logs with --logger-log-level paxos=trace option. But it's 927M and containt 8000523 messages from paxos. If there's anything else I can collect for you, let me know.

https://drive.google.com/file/d/1nO7Y4vaGm70GQIxzdbss35yha9uu4mka/view?usp=sharing

xemul on 1 Mar 2021

@psarna ping

slivne on 2 Mar 2021

@psarna ping

slivne on 4 Mar 2021

on iiit, somehow I missed the github notifications

psarna on 4 Mar 2021

@xemul I sent a request for accessing the file

psarna on 4 Mar 2021

Ok, the errors come from pruning (and I should have found this information in the issue description... I missed the last line). Now I'll try to figure out why the logs are spammed after this timeout actually gets propagated as a timeout instead of a failure.

psarna on 4 Mar 2021

Btw, @xemul, was it release mode? I see WARN 2021-03-01 19:45:18,472 [shard 1] storage_proxy - Failed to apply mutation from 127.0.67.3#1: utils::injected_error (injected_error_before_learn) in the logs, which indicates that the error injection is active?
The logs are probably still helpful, since some of the timeouts are not caused by error injections, I just want to make sure what's the environment I'm working on here.

psarna on 4 Mar 2021

I'm going to need trace logging for storage_proxy logger as well. I'm going to try locally first, and if I fail I'll come back here begging for more logs.

psarna on 4 Mar 2021

@psarna , it was dev build. But AFAIU error injection is not dev-specific. It's memory allocation failure injection that is dev-only.

xemul on 4 Mar 2021

Yeah, error injection is not dev-specific, but lack of it is release-specific - the problem I encountered when trying to reproduce locally is that a release build (the same one as in the issue description) showed no flood of timeout messages.

psarna on 4 Mar 2021

Release mode succeeds without a single timeout for me, every time. The trace from the issue description points to prune, but I don't see any unhandled exceptions in there... I'll try some more code inspection tomorrow, but it makes it way tougher that the problem does not reproduce for me locally.

psarna on 4 Mar 2021

I gave it some thought and I think that the main problem is that local paxos timeouts are always printed to the logs. There's a special path for local writes (regular ones, not paxos), in which no message gets printed to the log, precisely to avoid spamming the logs. It's even commented in the code that it's the purpose. It's done by catching a timed_out error.
But, a local paxos write is performed by issuing a local mutation via storage proxy, and if it fails, then it shows as a mutation_write_timeout_exception, not as a timed_out error, which means that it doesn't get into the special path and the error gets printed to the logs.

I'll try verifying this tomorrow (not sure how yet, given that the issue does not reproduce with release mode), but I think that it might be the root cause.

psarna on 4 Mar 2021

that doesn't explain the unhandled exceptional future from the issue description though, and I'm not able to reproduce that at all, regardless of the mode - dev or release.

psarna on 4 Mar 2021

OK, so at commit d6a94a7d dev-build still times out, release-build passes on my local node

xemul on 4 Mar 2021

psarna on 5 Mar 2021

👍2

Upd. I've lost the reproducer :( After upgrade of dtests (and lots of python libs, but maybe it's not related) the test just passed on dev build, though the logs are still full of timeouts and exceptions.

xemul on 5 Mar 2021

@bhalevy - do we need a backport to 4.4 - can I close this ?

slivne on 9 Mar 2021

We can backport 51037e94ecb53c20f4b7d394fe1c0c3b16b0479f to 4.5 but it's really cosmetic.

That said, we stopped running it in dtest-release in scylladb/scylla-dtest@754f57db128bd4c78a967a6d8f5ebde7d7145a6c (by @alecco), but it's not running in dtest-debug either since the latter uses -a dtest-debug and the tests is not tagged properly.

I suggest we first start running it again and make sure it passes, then close the issue.

bhalevy on 10 Mar 2021

@alecco / @xemul the test times out in debug mode after 2 hours.
Please see https://jenkins.scylladb.com/job/scylla-master/job/dtest-debug/743/artifact/logs-debug.2/orphaned/1615594847_lwt_banking_load_test.LWTBankingLoadTest.test_bank_with_nemesis/node1.log

bhalevy on 14 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

scylla-ami-setup: I/O Scheduler is not properly configured

amoskong · 6Comments

How do I query the number of items in a table that contains many, many, many pieces of data?

Felix-zhoux · 3Comments

Add a rest API to get the current compaction backlog for a particular table/KS/all

vladzcloudius · 5Comments

Unable to connect to Scylla via port 9042 (and others) on Docker overlay network, and the issue seems to be specific to Scylla.

mattcobley · 6Comments

ScyllaDB combines with SPDK

hellowaywewe · 3Comments