Scylla: mutation_test: fatal error: in "test_apply_monotonically_is_monotonic": Mutations differ

Created on 24 Feb 2021  路  9Comments  路  Source: scylladb/scylla

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/3221/artifact/testlog/dev/mutation_test.270.log
But reproducible also with master at 25f47561c in dev mode with same random-seed (but not in release for some reason)

Command used:

build/release/test/boost/mutation_test --catch_system_errors=no --color_output=false -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --default-log-level=warn --random-seed=2551724716
$ diff -u mutation_test.{expected,got}.log
--- mutation_test.expected.log  2021-02-24 18:20:16.325801046 +0200
+++ mutation_test.got.log   2021-02-24 18:20:51.070167948 +0200
@@ -1,5 +1,5 @@
-./test/lib/mutation_assertions.hh(108): fatal error: in "test_apply_monotonically_is_monotonic": Mutations differ, expected
-{table: 'ks.cf', key: {'pk': 1806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, token: -9205109811216406580}, mutation_partition: {
+ ...but got:
+ {table: 'ks.cf', key: {'pk': 1806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, token: -9205109811216406580}, mutation_partition: {
   tombstone: {tombstone: timestamp=-9223372036854775807, deletion_time=1},
   range_tombstones: {
     {range_tombstone: start={bound: prefix=ckp{00800306000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}, kind=incl start}, end={bound: prefix=ckp{0080060700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000806f03000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}, kind=excl end}, {tombstone: timestamp=-9223372036854775806, deletion_time=0}},
@@ -40,7 +40,7 @@
     's60': { collection {collection_mutation_view tombstone {tombstone: timestamp=-9223372036854775805, deletion_time=0}} },
     's63': { atomic_cell{9202000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775806,expiry=-1,ttl=0} },
     's7': { collection {collection_mutation_view tombstone {tombstone: timestamp=-9223372036854775805, deletion_time=0}} },
-    's9': { collection {collection_mutation_view tombstone {tombstone: timestamp=-9223372036854775806, deletion_time=0}00035b60-0000-1000-8080-808080808080: atomic_cell{1806000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775805,expiry=-1,ttl=0}, 0010eff0-0000-1000-8080-808080808080: atomic_cell{9202000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775805,expiry=-1,ttl=0}, 002ba3e0-0000-1000-8080-808080808080: atomic_cell{9202000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775805,expiry=-1,ttl=0}, 00509100-0000-1000-8080-808080808080: atomic_cell{7a05000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775805,expiry=-1,ttl=0}, 007222c0-0000-1000-8080-808080808080: atomic_cell{9202000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775805,expiry=-1,ttl=0}, 00908030-0000-1000-8080-808080808080: atomic_cell{9202000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,ts=-9223372036854775805,expiry=-1,ttl=0}} },
+    's9': { collection {collection_mutation_view tombstone {tombstone: timestamp=-9223372036854775806, deletion_time=0}} },
   },
   rows: [
     {

full log:
mutation_test.log.gz

Most helpful comment

This makes the bug go away^W^W^W^W^W makes the test pass:

diff --git a/mutation_partition.cc b/mutation_partition.cc
index 0028fccc8..7c3319802 100644
--- a/mutation_partition.cc
+++ b/mutation_partition.cc
@@ -1170,8 +1170,8 @@ row::apply(const column_definition& column, atomic_cell_or_collection&& value, c
 template<typename Func>
 void row::consume_with(Func&& func) {
     _cells.weed([func, this] (column_id id, cell_and_hash& cah) {
-        _size--;
         func(id, cah);
+        _size--;
         return true;
     });
 }

All 9 comments

Note: There is also https://github.com/scylladb/scylla/issues/8151 that popped up recently.
I don't know if they are related or not.

Hit again in https://jenkins.scylladb.com/job/scylla-enterprise/job/next/1164, https://jenkins.scylladb.com/job/scylla-enterprise/job/next/1166, https://jenkins.scylladb.com/job/scylla-enterprise/job/next/1167.

Bisected to 78d1afeabd0a229933fbdcc69632d60c569a8009. (build/dev/test/boost/mutation_test_g --catch_system_errors=no --color_output=false -t test_apply_monotonically_is_monotonic -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000 --random-seed=3764570480. Of course, it's theoretically possible that the bug is older and just happens for different seeds before this patch, but that's unlikely.)

But reproducible also with master at 25f4756 in dev mode with same random-seed (but not in release for some reason)

It only happens in dev because it happens inside memory::with_allocation_failures and only dev has alloc failure injection enabled by default. When alloc failure injection is manually enabled, the test fails in debug mode as well.

I tried to debug this issue, and noted that the bug occurs when an alloc failure happens while processing the last column of the static row. Unlike earlier columns, the last column will be not processed again when apply_monotonically is retried after an alloc failure. (Rephrased: if the alloc failure happens when processing one of the earlier columns and apply_monotonically is retried, processing will resume from the column that failed. If the alloc failure happens when processing the last column of the static row and apply_monotonically is retried, processing will resume after the static row). I didn't get further because compact_radix_tree::tree is too hard to debug due to all the templates and optimized-out values. But since this always (sample size 2, random seeds 2551724716 and 3764570480) seems to happen to the last column in the static row, it might be some kind of off-by-one.

Thanks @michoecho this is good lead, I was still coming from a compaction angle (based on past experience) but it indeed seems to be something else here.

This makes the bug go away^W^W^W^W^W makes the test pass:

diff --git a/mutation_partition.cc b/mutation_partition.cc
index 0028fccc8..7c3319802 100644
--- a/mutation_partition.cc
+++ b/mutation_partition.cc
@@ -1170,8 +1170,8 @@ row::apply(const column_definition& column, atomic_cell_or_collection&& value, c
 template<typename Func>
 void row::consume_with(Func&& func) {
     _cells.weed([func, this] (column_id id, cell_and_hash& cah) {
-        _size--;
         func(id, cah);
+        _size--;
         return true;
     });
 }

Lkely that's the fix. If func() throws then the consume will be restarted with reduced _size, but entry left in the _cells.

This is only a problem if func() failed while consuming the last cell in the row. When we retry the apply_monotonically() row::empty() returns true and hence this row is not considered for merging again, even though there is one cell left in it.

Regression introduced by f006acc853, not in any release stream.

Was this page helpful?
0 / 5 - 0 ratings