Scylla: test_bank_with_nemesis times out with new I/O scheduler

Created on 20 Dec 2020  路  35Comments  路  Source: scylladb/scylla

Regression seen since https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/691/testReport/lwt_banking_load_test/LWTBankingLoadTest/test_bank_with_nemesis/
Bisected down to 18be57a4e5b7c3df6b51d44a68c144e46f74a8f7
that includes scylladb/seastar@800f87a4398c5e4a529f00c872ab3e1aa548d96c

There are many thousands of timeout reports in the logs, for example:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/691/artifact/logs-release.2/orphaned/1607964940_lwt_banking_load_test.LWTBankingLoadTest.test_bank_with_nemesis/node1.log

INFO  2020-12-14 17:01:06,208 [shard 1] rpc - client 127.0.0.1:49529 msg_id 683:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,217 [shard 1] rpc - client 127.0.0.1:59041 msg_id 1200:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:50945 msg_id 916:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:64273 msg_id 820:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:64273 msg_id 816:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
INFO  2020-12-14 17:01:06,234 [shard 1] rpc - client 127.0.0.1:50945 msg_id 920:  exception "Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE." in no_wait handler ignored
WARN  2020-12-14 17:01:06,234 [shard 1] seastar - Exceptional future ignored: exceptions::mutation_write_timeout_exception (Operation timed out for system.paxos - received only 0 responses from 1 CL=ONE.), backtrace:    0x3c5047e
   0x3c508f0
   0x3c50c78
   0x387bef7
   0x1ebda0f
   0x38c6b8f
   0x38c7d77
   0x38e6448
   0x38924da
   /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/c60da2e90da1c5bc58a7799bd67220b9237d13cc/scylla/libreloc/libpthread.so.0+0x93f8
   /jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/c60da2e90da1c5bc58a7799bd67220b9237d13cc/scylla/libreloc/libc.so.6+0x101b02
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false> >(seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::future<void>::finally_body<service::paxos_response_handler::prune(utils::UUID)::$_7, false>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
bug high

Most helpful comment

@bhalevy I ran git blame again and finally found something I was looking for and somehow missed earlier (or maybe the last time I looked it wasn't there). 51037e94ec fixes this issue. prune used to not have a handler that catches all kinds of exceptions in case of failure, which resulted in Exceptional future ignored in the logs.

All 35 comments

@xemul please look into this issue.
It is highly reproducible.

@bhalevy , my bisect pointed to the 27fba358 commit. Can you please re-check on your side?

@psarna any idea how the hell this is related ?

The error type for local timeouts changed from write failure to write timeout, so perhaps one kind is correctly caught with try-catch, and the other isn't? I'll try to dig more deeply and find out.

Maybe htis can help -- the test in question generates tons of expected injected_error-s from service/paxos/paxos_state.cc paxos_state::prepare().

Maybe htis can help -- the test in question generates tons of expected injected_error-s from service/paxos/paxos_state.cc paxos_state::prepare().

@xemul but error injection is disabled in debug build, isn't it?
Where did you see these tons of expected injected errors? Did you try to reproduce in dev mode?

It was dev mode, yes. @bhalevy

ok, so they are not relevant to the timeouts seen in dtest-release mode

Can I get a clarification on the bisection? I got lost a little. The original description says 18be57a, later it was bisected to 27fba35 - but is this second bisect valid, or was it a red herring due to running in dev mode? @bhalevy @xemul

@psarna , the 18be57a is the commit updating seastar, that brought new IO scheduler into scylla.
_After_ it there came the 39afe14a, which is the merge commit of your 'Add per query timeout' branch (carrying the 27fba35 on board).

I did 4 tests on local dev-build:

On former commit -- several timeouts in logs (I saw ~10 timeouts on every "good" run)
On latter commit -- thousands (literally 2000 and more) of timeouts in logs
On latter commit with new IO scheduler patched-out by hands -- thousands timeouts in logs again
Unbiased git-bisect between those two, checking for the number of timeouts in logs to be higher than 100 -- it lead me to the 27fba35 commit

@psarna ping

On it. I need to pinpoint the place where the future which adds stuff to paxos tables is ignored, and then it will be easier to investigate why the number of warnings increased.

Um, can somebody confirm this issue still happens? I just tested on master (release mode) and encountered 0 timeouts.

Maybe it's hardware dependent.

If possible, it would help very much to run the reproducer (if somebody has a working one) with trace logging level for paxos, in order to see which operation causes the timeouts. One of the operations that is left in the background must not handle the timeout error properly, but there's many of them in paxos_state.ccstorage_proxy.cc alone and I'm not able to easily guess which one causes trouble.

I'll try to repeat this on my node a bit later and report back.

@psarna , I've reproduced it locally on scylla mastera & scylla-dtest commit d1041a56, here are the logs with --logger-log-level paxos=trace option. But it's 927M and containt 8000523 messages from paxos. If there's anything else I can collect for you, let me know.

https://drive.google.com/file/d/1nO7Y4vaGm70GQIxzdbss35yha9uu4mka/view?usp=sharing

@psarna ping

@psarna ping

on iiit, somehow I missed the github notifications

@xemul I sent a request for accessing the file

Ok, the errors come from pruning (and I should have found this information in the issue description... I missed the last line). Now I'll try to figure out why the logs are spammed after this timeout actually gets propagated as a timeout instead of a failure.

Btw, @xemul, was it release mode? I see WARN 2021-03-01 19:45:18,472 [shard 1] storage_proxy - Failed to apply mutation from 127.0.67.3#1: utils::injected_error (injected_error_before_learn) in the logs, which indicates that the error injection is active?
The logs are probably still helpful, since some of the timeouts are not caused by error injections, I just want to make sure what's the environment I'm working on here.

I'm going to need trace logging for storage_proxy logger as well. I'm going to try locally first, and if I fail I'll come back here begging for more logs.

@psarna , it was dev build. But AFAIU error injection is not dev-specific. It's memory allocation failure injection that is dev-only.

Yeah, error injection is not dev-specific, but lack of it is release-specific - the problem I encountered when trying to reproduce locally is that a release build (the same one as in the issue description) showed no flood of timeout messages.

Release mode succeeds without a single timeout for me, every time. The trace from the issue description points to prune, but I don't see any unhandled exceptions in there... I'll try some more code inspection tomorrow, but it makes it way tougher that the problem does not reproduce for me locally.

I gave it some thought and I think that the main problem is that local paxos timeouts are always printed to the logs. There's a special path for local writes (regular ones, not paxos), in which no message gets printed to the log, precisely to avoid spamming the logs. It's even commented in the code that it's the purpose. It's done by catching a timed_out error.
But, a local paxos write is performed by issuing a local mutation via storage proxy, and if it fails, then it shows as a mutation_write_timeout_exception, not as a timed_out error, which means that it doesn't get into the special path and the error gets printed to the logs.

I'll try verifying this tomorrow (not sure how yet, given that the issue does not reproduce with release mode), but I think that it might be the root cause.

  • that doesn't explain the unhandled exceptional future from the issue description though, and I'm not able to reproduce that at all, regardless of the mode - dev or release.

OK, so at commit d6a94a7d dev-build still times out, release-build passes on my local node

@bhalevy I ran git blame again and finally found something I was looking for and somehow missed earlier (or maybe the last time I looked it wasn't there). 51037e94ec fixes this issue. prune used to not have a handler that catches all kinds of exceptions in case of failure, which resulted in Exceptional future ignored in the logs.

Upd. I've lost the reproducer :( After upgrade of dtests (and lots of python libs, but maybe it's not related) the test just passed on dev build, though the logs are still full of timeouts and exceptions.

@bhalevy - do we need a backport to 4.4 - can I close this ?

We can backport 51037e94ecb53c20f4b7d394fe1c0c3b16b0479f to 4.5 but it's really cosmetic.

That said, we stopped running it in dtest-release in scylladb/scylla-dtest@754f57db128bd4c78a967a6d8f5ebde7d7145a6c (by @alecco), but it's not running in dtest-debug either since the latter uses -a dtest-debug and the tests is not tagged properly.

I suggest we first start running it again and make sure it passes, then close the issue.

Was this page helpful?
0 / 5 - 0 ratings