Scylla: evictable_reader: out-of-range range tombstones emitted on reader recreation cause fragment stream monotonicity violations

Created on 9 Sep 2020  路  53Comments  路  Source: scylladb/scylla

Installation details
Scylla version (or git commit hash): 4.2.rc4-0.20200907.bf0c493c28 with build-id 49cc32b091e40433f7006467b50f2e6722d6074b
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0787db7b551e1d03c(eu-north-1)

Job https://jenkins.scylladb.com/job/scylla-4.2/job/longevity/job/longevity-large-partition-200k-pks-4days-test/4/ failed with coredumps on 4 nodes
On node5 was running RepairStreamingErr. Scylla was stopped, Several sstables were removed

2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/rm -f /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-CRC.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Data.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Digest.crc32 /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Filter.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Index.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Scylla.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Statistics.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-Summary.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-161-big-TOC.txt
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/sh -c find /var/lib/scylla/data/scylla_bench/test-* -maxdepth 1 -type f
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/rm -f /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-CRC.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Data.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Digest.crc32 /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Filter.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Index.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Scylla.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Statistics.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-Summary.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-124-big-TOC.txt
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/sh -c find /var/lib/scylla/data/scylla_bench/test-* -maxdepth 1 -type f
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/rm -f /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-CRC.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Data.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Digest.crc32 /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Filter.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Index.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Scylla.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Statistics.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-Summary.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-164-big-TOC.txt
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/sh -c find /var/lib/scylla/data/scylla_bench/test-* -maxdepth 1 -type f
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/rm -f /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-CRC.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Data.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Digest.crc32 /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Filter.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Index.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Scylla.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Statistics.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-Summary.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-402-big-TOC.txt
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/sh -c find /var/lib/scylla/data/scylla_bench/test-* -maxdepth 1 -type f
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | sudo: pam_unix(sudo:session): session closed for user root
2020-09-09T15:04:36+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !NOTICE  | sudo:  centos : TTY=unknown ; PWD=/home/centos ; USER=root ; COMMAND=/bin/rm -f /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-CRC.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Data.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Digest.crc32 /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Filter.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Index.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Scylla.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Statistics.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-Summary.db /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-198-big-TOC.txt

After that scylla was started and after scylla is up nodetool repair started.

2020-09-09T15:06:30+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 10] repair - Repair 357 out of 798 ranges, id=3, shard=10, keyspace=scylla_bench, table={test, test_counters}, range=(-1306270692098423114, -1300644204403719293]
2020-09-09T15:06:30+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 10] repair - Repair 358 out of 798 ranges, id=3, shard=10, keyspace=scylla_bench, table={test, test_counters}, range=(-1300644204403719293, -1278315473952253769]

2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 0] gossip - InetAddress 10.0.3.148 is now DOWN, status = NORMAL
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 4] repair - repair id 3 on shard 4, keyspace=scylla_bench, cf=test, range=(-4266475509566451789, -4260169288585812631], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 7] repair - repair id 3 on shard 7, keyspace=scylla_bench, cf=test, range=(-6071420363549842715, -6053615954840553304], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 2] repair - repair id 3 on shard 2, keyspace=scylla_bench, cf=test, range=(-6074866704489542388, -6071420363549842715], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 9] repair - Repair 214 out of 798 ranges, id=3, shard=9, keyspace=scylla_bench, table={test, test_counters}, range=(-4212704619385569635, -4200812148426214517]
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 0] repair - repair id 3 on shard 0, keyspace=scylla_bench, cf=test, range=(-inf, -9186517138740727452], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 9] repair - Repair 215 out of 798 ranges, id=3, shard=9, keyspace=scylla_bench, table={test, test_counters}, range=(-4200812148426214517, -4198855502780856344]
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 7] repair - repair id 3 on shard 7, keyspace=scylla_bench, cf=test, range=(-5092412548355119091, -5078703936262561037], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 4] repair - repair id 3 on shard 4, keyspace=scylla_bench, cf=test, range=(-4247965060690677791, -4212704619385569635], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 11] repair - Repair 277 out of 798 ranges, id=3, shard=11, keyspace=scylla_bench, table={test, test_counters}, range=(-3118054428779831217, -3112350764213830676]
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !WARNING | scylla: [shard 6] repair - repair id 3 on shard 6, keyspace=scylla_bench, cf=test, range=(-4117610456030314987, -4090967356366011766], got error in row level repair: seastar::rpc::closed_error (connection is closed)
2020-09-09T15:06:52+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 !INFO    | scylla: [shard 0] repair - Repair 184 out of 798 ranges, id=3, shard=0, keyspace=scylla_bench, table={test, test_counters}, range=(-4896374765420338731, -4878959984160755275]

And at this moment error and coredump happened on another 4 nodes:
node4:

2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !ERR     | scylla: [shard 4] flat_mutation_reader - [validator 0x604034d5ff00 for sstable writer /var/lib/scylla/data/scylla_bench/test-a3eadd80f2a711ea853c000000000004/mc-400-big-Data.db (scylla_bench.test a3eadd80-f2a7-11ea-853c-000000000004)] Unexpected mutation fragment: previous clustering row:{position: clustered,ckp{0008000000000000ec33},0}, current range tombstone:{position: clustered,ckp{00080000000000010469},-1}, at:    0x333167d#012   0x3331990#012   0x3331e19#012   0x2e42ccc#012   0x117e5c3#012   0x119a9f6#012   0x12fa3c8#012   0x1243e7f#012   0x1244873#012   0x313f25c#012   --------#012   N7seastar12continuationINS_8internal22promise_base_with_typeIJEEEZNS_5asyncIZN8sstables7sstable16write_componentsE20flat_mutation_readermNS_13lw_shared_ptrIK6schemaEERKNS5_21sstable_writer_configE14encoding_statsRKNS_17io_priority_classEEUlvE_JEEENS_8futurizeINSt9result_ofIFNSt5decayIT_E4typeEDpNSM_IT0_E4typeEEE4typeEE4typeENS_17thread_attributesEOSN_DpOSQ_EUlvE0_ZZNS_6futureIJEE14then_impl_nrvoIS13_S15_EET0_S10_ENKUlvE_clEvEUlRS3_RS13_ONS_12future_stateIJEEEE_JEEE#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, seastar::future<>::finally_body<seastar::async<sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#1}>(seastar::thread_attributes, std::decay&&, (std::decay<sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#1}>::type&&)...)::{lambda()#3}, false>, seastar::future<>::then_wrapped_nrvo<seastar::future<>, {lambda()#3}>({lambda()#3}&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, {lambda()#3}&, seastar::future_state<>&&)#1}>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, seastar::future<>::finally_body<sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#2}, false>, seastar::future<>::then_wrapped_nrvo<seastar::future<>, sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#2}>(sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#2}&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, sstables::sstable::write_components(flat_mutation_reader, unsigned long, seastar::lw_shared_ptr<schema const>, sstables::sstable_writer_config const&, encoding_stats, seastar::io_priority_class const&)::{lambda()#2}&, seastar::future_state<>&&)#1}>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, repair_writer::create_writer(seastar::sharded<database>&, unsigned int)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader) const::{lambda(bool)#1}::operator()(bool)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader) const::{lambda()#1}, seastar::future<>::then_impl_nrvo<{lambda(bool)#1}, {lambda(flat_mutation_reader)#1}>({lambda(bool)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, {lambda(bool)#1}&, seastar::future_state<>&&)#1}>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, repair_writer::create_writer(seastar::sharded<database>&, unsigned int)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader) const::{lambda(bool)#1}::operator()(bool)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader) const::{lambda()#2}, seastar::future<>::then_impl_nrvo<{lambda(bool)#1}, {lambda(flat_mutation_reader)#1}>({lambda(bool)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, {lambda(bool)#1}&, seastar::future_state<>&&)#1}>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, repair_writer::create_writer(seastar::sharded<database>&, unsigned int)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader) const::{lambda(bool)#1}::operator()(bool)::{lambda(flat_mutation_reader)#1}::operator()(flat_mutation_reader) const::{lambda()#3}, seastar::future<>::then_impl_nrvo<{lambda(bool)#1}, {lambda(flat_mutation_reader)#1}>({lambda(bool)#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, {lambda(bool)#1}&, seastar::future_state<>&&)#1}>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, seastar::future<>::finally_body<seastar::smp::submit_to<mutation_writer::multishard_writer::consume(unsigned int)::{lambda()#1}>(unsigned int, seastar::smp_submit_to_options, std::result_of&&)::{lambda()#1}, false>, seastar::future<>::then_wrapped_nrvo<seastar::future<>, {lambda()#1}>({lambda()#1}&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, {lambda()#1}&, seastar::future_state<>&&)#1}>#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<>, seastar::future<>::handle_exception<mutation_writer::multishard_writer::consume(unsigned int)::{lambda(std::__exception_ptr::exception_ptr)#2}>(mutation_writer::multishard_writer::consume(unsigned int)::{lambda(std::__exception_ptr::exception_ptr)#2}&&)::{lambda(auto:1)#1}, seastar::future<>::then_wrapped_nrvo<seastar::future<>, mutation_writer::multishard_writer::consume(unsigned int)::{lambda(std::__exception_ptr::exception_ptr)#2}&&>(seastar::future<>&&)::{lambda()#1}::operator()() const::{lambda(seastar::internal::promise_base_with_type<>&, auto:1&, seastar::future_state<>&&)#1}>

2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: Aborting on shard 4.
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: Backtrace:
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000002ed6582
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000002e7ab00
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000002e7ada5
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000002e7adf0
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x00007fc536773a8f
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x000000000003c9e4
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: /opt/scylladb/libreloc/libc.so.6+0x0000000000025894
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000002e42cf2
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x000000000117e5c3
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x000000000119a9f6
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x00000000012fa3c8
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000001243e7f
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x0000000001244873
2020-09-09T15:06:29+00:00  longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 !INFO    | scylla: 0x000000000313f25c

Decoded backtrace:

[centos@longevity-large-partitions-200k-pks-db-node-f93d5ecf-3 ~]$ addr2line -Cpife /usr/lib/debug/opt/scylladb/libexec/scylla-4.2.rc4-0.20200907.bf0c493c28.x86_64.debug 0x0000000002ed6582 0x0000000002e7ab00 0x0000000002e7ada5 0x0000000002e7adf0 0x00007fc536773a8f /opt/scylladb/libreloc/libc.so.6+0x000000000003c9e4 /opt/scylladb/libreloc/libc.so.6+0x0000000000025894 0x0000000002e42cf2 0x000000000117e5c3 0x000000000119a9f6 0x00000000012fa3c8 0x0000000001243e7f 0x0000000001244873 0x000000000313f25c 
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /usr/include/fmt/format.h:2188
seastar::backtrace_buffer::append_backtrace() at /usr/include/fmt/format.h:2188
 (inlined by) print_with_backtrace at /jenkins/workspace/scylla-4.2/build/scylla/seastar/src/core/reactor.cc:751
seastar::print_with_backtrace(char const*) at /usr/include/fmt/format.h:2188
sigabrt_action at /usr/include/fmt/format.h:2188
 (inlined by) operator() at /jenkins/workspace/scylla-4.2/build/scylla/seastar/src/core/reactor.cc:3451
 (inlined by) _FUN at /jenkins/workspace/scylla-4.2/build/scylla/seastar/src/core/reactor.cc:3447
?? ??:0
?? ??:0
?? ??:0
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char> >) at /jenkins/workspace/scylla-4.2/build/scylla/seastar/src/core/on_internal_error.cc:39 (discriminator 2)
(anonymous namespace)::on_validation_error(seastar::logger&, seastar::basic_sstring<char, unsigned int, 15u, true> const&) [clone .constprop.0] at /usr/include/fmt/format.h:2188
mutation_fragment_stream_validating_filter::operator()(mutation_fragment const&) at /usr/include/fmt/format.h:2188
void flat_mutation_reader::impl::consume_pausable_in_thread<std::reference_wrapper<flat_mutation_reader::impl::consumer_adapter<sstables::sstable_writer> >, mutation_fragment_stream_validating_filter>(std::reference_wrapper<flat_mutation_reader::impl::consumer_adapter<sstables::sstable_writer> >, mutation_fragment_stream_validating_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at /jenkins/workspace/scylla-4.2/build/scylla/./flat_mutation_reader.hh:188
auto flat_mutation_reader::impl::consume_in_thread<sstables::sstable_writer, mutation_fragment_stream_validating_filter>(sstables::sstable_writer, mutation_fragment_stream_validating_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at /usr/include/fmt/format.h:2188
 (inlined by) auto flat_mutation_reader::consume_in_thread<sstables::sstable_writer, mutation_fragment_stream_validating_filter>(sstables::sstable_writer, mutation_fragment_stream_validating_filter, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >) at /jenkins/workspace/scylla-4.2/build/scylla/./flat_mutation_reader.hh:382
 (inlined by) operator() at /jenkins/workspace/scylla-4.2/build/scylla/sstables/sstables.cc:2465
__invoke_impl<void, sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > at /usr/include/fmt/format.h:2188
 (inlined by) __invoke<sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > at /usr/include/c++/10/bits/invoke.h:95
 (inlined by) __apply_impl<sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()>, std::tuple<> > at /usr/include/c++/10/tuple:1723
 (inlined by) apply<sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()>, std::tuple<> > at /usr/include/c++/10/tuple:1734
 (inlined by) apply<sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > at /jenkins/workspace/scylla-4.2/build/scylla/seastar/include/seastar/core/future.hh:1976
 (inlined by) operator() at /jenkins/workspace/scylla-4.2/build/scylla/seastar/include/seastar/core/thread.hh:259
 (inlined by) call at /jenkins/workspace/scylla-4.2/build/scylla/seastar/include/seastar/util/noncopyable_function.hh:101
seastar::noncopyable_function<void ()>::operator()() const at /jenkins/workspace/scylla-4.2/build/scylla/seastar/include/seastar/util/noncopyable_function.hh:184
 (inlined by) seastar::thread_context::main() at /jenkins/workspace/scylla-4.2/build/scylla/seastar/src/core/thread.cc:297

Same error and core happened on node1, node2, node3

2020-09-09 15:06:33.000: (CoreDumpEvent Severity.ERROR): node=Node longevity-large-partitions-200k-pks-db-node-f93d5ecf-3 [13.49.78.64 | 10.0.2.115] (seed: False)
corefile_url=
https://storage.cloud.google.com/upload.scylladb.com/core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000/core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000.gz
backtrace=           PID: 3715 (scylla)
UID: 996 (scylla)
GID: 1001 (scylla)
Signal: 6 (ABRT)
Timestamp: Wed 2020-09-09 15:06:33 UTC (1min 36s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-11 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /
Boot ID: 0d6df37b9fca4e6ca646ad6fae051760
Machine ID: df877a200226bc47d06f26dae0736ec9
Hostname: longevity-large-partitions-200k-pks-db-node-f93d5ecf-3
Coredump: /var/lib/systemd/coredump/core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000
Message: Process 3715 (scylla) of user 996 dumped core.
Stack trace of thread 3739:
#0  0x00007f2a959519e5 raise (libc.so.6)
#1  0x00007f2a9593a94d abort (libc.so.6)
#2  0x0000000002e42cf3 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
#3  0x000000000117e5c4 on_validation_error (scylla)
#4  0x000000000119a9f7 _ZN42mutation_fragment_stream_validating_filterclERK17mutation_fragment (scylla)
#5  0x00000000012fa3c9 _ZN20flat_mutation_reader4impl26consume_pausable_in_threadISt17reference_wrapperINS0_16consumer_adapterIN8sstables14sstable_writerEEEE42mutation_fragment_stream_validating_filterEEvT_T0_NSt6chrono10time_pointIN7seastar12lowres_clockENSB_8durationIlSt5ratioILl1ELl1000EEEEEE (scylla)
#6  0x0000000001243e80 _ZZN8sstables7sstable16write_componentsE20flat_mutation_readermN7seastar13lw_shared_ptrIK6schemaEERKNS_21sstable_writer_configE14encoding_statsRKNS2_17io_priority_classEENUlvE_clEv (scylla)
#7  0x0000000001244874 __invoke_impl<void, sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > (scylla)
#8  0x000000000313f25d _ZNK7seastar20noncopyable_functionIFvvEEclEv (scylla)
Stack trace of thread 3753:
#0  0x00007f2a9639d9ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f2a96393432 start_thread (libpthread.so.0)
#5  0x00007f2a95a16913 __clone (libc.so.6)
Stack trace of thread 3749:
#0  0x00007f2a9639d9ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f2a96393432 start_thread (libpthread.so.0)
#5  0x00007f2a95a16913 __clone (libc.so.6)
Stack trace of thread 3746:
#0  0x00007f2a9639d9ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f2a96393432 start_thread (libpthread.so.0)
#5  0x00007f2a95a16913 __clone (libc.so.6)
Stack trace of thread 3748:
#0  0x00007f2a9639d9ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f2a96393432 start_thread (libpthread.so.0)
#5  0x00007f2a95a16913 __clone (libc.so.6)
Stack trace of thread 3743:
#0  0x00007f2a9639d9ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f2a96393432 start_thread (libpthread.so.0)
#5  0x00007f2a95a16913 __clone (libc.so.6)
Stack trace of thread 3742:
#0  0x00007f2a9639d9ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f2a96393432 start_thread (libpthread.so.0)
#5  0x00007f2a95a16913 __clone (libc.so.6)
Stack trace of thread 3740:
#0  0x00007f2a95a1137d syscall (libc.so.6)
#1  0x000000000310da9d _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2  0x0000000003108ff4 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3  0x00000000031091b5 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla)
#4  0x0000000002e785c0 _ZN7seastar7reactor5sleepEv (scylla)
#5  0x0000000002eaf82e _ZN7seastar7reactor3runEv (scylla)
#6  0x0000000002ebecfb _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS
download_instructions=
gsutil cp gs://upload.scylladb.com/core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000/core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000.gz
2020-09-09 15:06:29.000: (CoreDumpEvent Severity.ERROR): node=Node longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 [13.49.80.133 | 10.0.3.192] (seed: False)
corefile_url=
https://storage.cloud.google.com/upload.scylladb.com/core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000/core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000.gz
backtrace=           PID: 4183 (scylla)
UID: 996 (scylla)
GID: 1001 (scylla)
Signal: 6 (ABRT)
Timestamp: Wed 2020-09-09 15:06:29 UTC (3min 5s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-11 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /
Boot ID: 726b0a32f57d44b69419d9271fb74b06
Machine ID: df877a200226bc47d06f26dae0736ec9
Hostname: longevity-large-partitions-200k-pks-db-node-f93d5ecf-4
Coredump: /var/lib/systemd/coredump/core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000
Message: Process 4183 (scylla) of user 996 dumped core.
Stack trace of thread 4187:
#0  0x00007fc535d269e5 raise (libc.so.6)
#1  0x00007fc535d0f94d abort (libc.so.6)
#2  0x0000000002e42cf3 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
#3  0x000000000117e5c4 on_validation_error (scylla)
#4  0x000000000119a9f7 _ZN42mutation_fragment_stream_validating_filterclERK17mutation_fragment (scylla)
#5  0x00000000012fa3c9 _ZN20flat_mutation_reader4impl26consume_pausable_in_threadISt17reference_wrapperINS0_16consumer_adapterIN8sstables14sstable_writerEEEE42mutation_fragment_stream_validating_filterEEvT_T0_NSt6chrono10time_pointIN7seastar12lowres_clockENSB_8durationIlSt5ratioILl1ELl1000EEEEEE (scylla)
#6  0x0000000001243e80 _ZZN8sstables7sstable16write_componentsE20flat_mutation_readermN7seastar13lw_shared_ptrIK6schemaEERKNS_21sstable_writer_configE14encoding_statsRKNS2_17io_priority_classEENUlvE_clEv (scylla)
#7  0x0000000001244874 __invoke_impl<void, sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > (scylla)
#8  0x000000000313f25d _ZNK7seastar20noncopyable_functionIFvvEEclEv (scylla)
Stack trace of thread 4203:
#0  0x00007fc5367729ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fc536768432 start_thread (libpthread.so.0)
#5  0x00007fc535deb913 __clone (libc.so.6)
Stack trace of thread 4200:
#0  0x00007fc5367729ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fc536768432 start_thread (libpthread.so.0)
#5  0x00007fc535deb913 __clone (libc.so.6)
Stack trace of thread 4201:
#0  0x00007fc5367729ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fc536768432 start_thread (libpthread.so.0)
#5  0x00007fc535deb913 __clone (libc.so.6)
Stack trace of thread 4195:
#0  0x00007fc5367729ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fc536768432 start_thread (libpthread.so.0)
#5  0x00007fc535deb913 __clone (libc.so.6)
Stack trace of thread 4188:
#0  0x00007fc535de637d syscall (libc.so.6)
#1  0x000000000310da9d _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2  0x0000000003108ff4 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3  0x00000000031091b5 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla)
#4  0x0000000002e785c0 _ZN7seastar7reactor5sleepEv (scylla)
#5  0x0000000002eaf82e _ZN7seastar7reactor3runEv (scylla)
#6  0x0000000002ebecfb _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#8  0x00007fc536768432 start_thread (libpthread.so.0)
#9  0x00007fc535deb913 __clone (libc.so.6)
Stack trace of thread 4198:
#0  0x00007fc5367729ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fc536768432 start_thread (libpthread.so.0)
#5  0x00007fc535deb913 __clone (libc.so.6)
Stack trace of thread 4199:
#0  0x00007fc5367729ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2
download_instructions=
gsutil cp gs://upload.scylladb.com/core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000/core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000.gz



md5-fdf80d37e2d4728db4f1d705c41df1cc



2020-09-09 15:06:33.000: (CoreDumpEvent Severity.ERROR): node=Node longevity-large-partitions-200k-pks-db-node-f93d5ecf-2 [13.49.73.139 | 10.0.2.80] (seed: False)
corefile_url=
https://storage.cloud.google.com/upload.scylladb.com/core.scylla.996.b74c2e9da27347799a357c6dbd559733.3103.1599663993000000/core.scylla.996.b74c2e9da27347799a357c6dbd559733.3103.1599663993000000.gz
backtrace=           PID: 3103 (scylla)
UID: 996 (scylla)
GID: 1001 (scylla)
Signal: 6 (ABRT)
Timestamp: Wed 2020-09-09 15:06:33 UTC (3min 37s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-11 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /
Boot ID: b74c2e9da27347799a357c6dbd559733
Machine ID: df877a200226bc47d06f26dae0736ec9
Hostname: longevity-large-partitions-200k-pks-db-node-f93d5ecf-2
Coredump: /var/lib/systemd/coredump/core.scylla.996.b74c2e9da27347799a357c6dbd559733.3103.1599663993000000
Message: Process 3103 (scylla) of user 996 dumped core.
Stack trace of thread 3113:
#0  0x00007f707790c9e5 raise (libc.so.6)
#1  0x00007f70778f594d abort (libc.so.6)
#2  0x0000000002e42cf3 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
#3  0x000000000117e5c4 on_validation_error (scylla)
#4  0x000000000119a9f7 _ZN42mutation_fragment_stream_validating_filterclERK17mutation_fragment (scylla)
#5  0x00000000012fa3c9 _ZN20flat_mutation_reader4impl26consume_pausable_in_threadISt17reference_wrapperINS0_16consumer_adapterIN8sstables14sstable_writerEEEE42mutation_fragment_stream_validating_filterEEvT_T0_NSt6chrono10time_pointIN7seastar12lowres_clockENSB_8durationIlSt5ratioILl1ELl1000EEEEEE (scylla)
#6  0x0000000001243e80 _ZZN8sstables7sstable16write_componentsE20flat_mutation_readermN7seastar13lw_shared_ptrIK6schemaEERKNS_21sstable_writer_configE14encoding_statsRKNS2_17io_priority_classEENUlvE_clEv (scylla)
#7  0x0000000001244874 __invoke_impl<void, sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > (scylla)
#8  0x000000000313f25d _ZNK7seastar20noncopyable_functionIFvvEEclEv (scylla)
Stack trace of thread 3121:
#0  0x00007f70783589ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f707834e432 start_thread (libpthread.so.0)
#5  0x00007f70779d1913 __clone (libc.so.6)
Stack trace of thread 3123:
#0  0x00007f70783589ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f707834e432 start_thread (libpthread.so.0)
#5  0x00007f70779d1913 __clone (libc.so.6)
Stack trace of thread 3118:
#0  0x00007f70783589ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f707834e432 start_thread (libpthread.so.0)
#5  0x00007f70779d1913 __clone (libc.so.6)
Stack trace of thread 3114:
#0  0x00007f70779cc37d syscall (libc.so.6)
#1  0x000000000310da9d _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2  0x0000000003108ff4 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3  0x00000000031091b5 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla)
#4  0x0000000002e785c0 _ZN7seastar7reactor5sleepEv (scylla)
#5  0x0000000002eaf82e _ZN7seastar7reactor3runEv (scylla)
#6  0x0000000002ebecfb _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#7  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#8  0x00007f707834e432 start_thread (libpthread.so.0)
#9  0x00007f70779d1913 __clone (libc.so.6)
Stack trace of thread 3115:
#0  0x00007f70783589ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007f707834e432 start_thread (libpthread.so.0)
#5  0x00007f70779d1913 __clone (libc.so.6)
Stack trace of thread 3108:
#0  0x00007f70779cc37d syscall (libc.so.6)
#1  0x000000000310da9d _ZN7seastar8internal13io_pgeteventsEmllPNS0_9linux_abi8io_eventEPK8timespecPK10__sigset_tb (scylla)
#2  0x0000000003108ff4 _ZN7seastar19reactor_backend_aio12await_eventsEiPK10__sigset_t (scylla)
#3  0x00000000031091b5 _ZN7seastar19reactor_backend_aio23wait_and_process_eventsEPK10__sigset_t (scylla)
#4  0x0000000002e785c0 _ZN7seastar7reactor5sleepEv (scylla)
#5  0x0000000002eaf82e _ZN7seastar7reactor3runEv (scylla)
#6  0x0000000002ebecfb _ZZN
download_instructions=
gsutil cp gs://upload.scylladb.com/core.scylla.996.b74c2e9da27347799a357c6dbd559733.3103.1599663993000000/core.scylla.996.b74c2e9da27347799a357c6dbd559733.3103.1599663993000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.996.b74c2e9da27347799a357c6dbd559733.3103.1599663993000000.gz



md5-fdf80d37e2d4728db4f1d705c41df1cc



2020-09-09 15:06:29.000: (CoreDumpEvent Severity.ERROR): node=Node longevity-large-partitions-200k-pks-db-node-f93d5ecf-1 [13.53.126.243 | 10.0.3.148] (seed: True)
corefile_url=
https://storage.cloud.google.com/upload.scylladb.com/core.scylla.996.cb537eccb6af44e4b12b4442c180ee91.2868.1599663989000000/core.scylla.996.cb537eccb6af44e4b12b4442c180ee91.2868.1599663989000000.gz
backtrace=           PID: 2868 (scylla)
UID: 996 (scylla)
GID: 1001 (scylla)
Signal: 6 (ABRT)
Timestamp: Wed 2020-09-09 15:06:29 UTC (4min 19s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-11 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /
Boot ID: cb537eccb6af44e4b12b4442c180ee91
Machine ID: df877a200226bc47d06f26dae0736ec9
Hostname: longevity-large-partitions-200k-pks-db-node-f93d5ecf-1
Coredump: /var/lib/systemd/coredump/core.scylla.996.cb537eccb6af44e4b12b4442c180ee91.2868.1599663989000000
Message: Process 2868 (scylla) of user 996 dumped core.
Stack trace of thread 2872:
#0  0x00007fbcb9f679e5 raise (libc.so.6)
#1  0x00007fbcb9f5094d abort (libc.so.6)
#2  0x0000000002e42cf3 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
#3  0x000000000117e5c4 on_validation_error (scylla)
#4  0x000000000119a9f7 _ZN42mutation_fragment_stream_validating_filterclERK17mutation_fragment (scylla)
#5  0x00000000012fa3c9 _ZN20flat_mutation_reader4impl26consume_pausable_in_threadISt17reference_wrapperINS0_16consumer_adapterIN8sstables14sstable_writerEEEE42mutation_fragment_stream_validating_filterEEvT_T0_NSt6chrono10time_pointIN7seastar12lowres_clockENSB_8durationIlSt5ratioILl1ELl1000EEEEEE (scylla)
#6  0x0000000001243e80 _ZZN8sstables7sstable16write_componentsE20flat_mutation_readermN7seastar13lw_shared_ptrIK6schemaEERKNS_21sstable_writer_configE14encoding_statsRKNS2_17io_priority_classEENUlvE_clEv (scylla)
#7  0x0000000001244874 __invoke_impl<void, sstables::sstable::write_components(flat_mutation_reader, uint64_t, sstables::schema_ptr, const sstables::sstable_writer_config&, encoding_stats, const seastar::io_priority_class&)::<lambda()> > (scylla)
#8  0x000000000313f25d _ZNK7seastar20noncopyable_functionIFvvEEclEv (scylla)
Stack trace of thread 2883:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2885:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2886:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2889:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2891:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2882:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2884:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _ZNKSt8functionIFvvEEclEv (scylla)
#4  0x00007fbcba9a9432 start_thread (libpthread.so.0)
#5  0x00007fbcba02c913 __clone (libc.so.6)
Stack trace of thread 2881:
#0  0x00007fbcba9b39ac read (libpthread.so.0)
#1  0x0000000003106b97 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x0000000003106df8 operator() (scylla)
#3  0x0000000002e42f7e _
download_instructions=
gsutil cp gs://upload.scylladb.com/core.scylla.996.cb537eccb6af44e4b12b4442c180ee91.2868.1599663989000000/core.scylla.996.cb537eccb6af44e4b12b4442c180ee91.2868.1599663989000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.996.cb537eccb6af44e4b12b4442c180ee91.2868.1599663989000000.gz

bug high repair

All 53 comments

Nodes are available:

Name | Ip address | Current State | Cloud | Region
-- | -- | -- | -- | --
longevity-large-partitions-200k-pks-db-node-f93d5ecf-2 | 13.49.73.139 | running | aws | eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-3 | 13.49.78.64 | running | aws | eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 | 13.49.80.133 | running | aws | eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 | 13.48.105.201 | running | aws | eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-1 | 13.53.126.243 | running | aws | eu-north-1
SCT-Runner | 13.53.38.50 | running | aws | eu-north-1
longevity-large-partitions-200k-pks-monitor-node-f93d5ecf-1 | 13.48.48.151 | running | aws | eu-north-1

@bhalevy is someone from your team can take a look on this?
We still have the cluster if it helps.

@denesb please look into this.
Can it stem from the combining reader?

Cc @asias repair is involved

core.scylla.996.0d6df37b9fca4e6ca646ad6fae051760.3715.1599663993000000

We crash because the current fragment (range tombstone) is smaller than the previous fragment (clustering row):

(gdb) p _validator._prev_pos._ck
$36 = {
  <std::_Optional_base<clustering_key_prefix, false, false>> = {
    <std::_Optional_base_impl<clustering_key_prefix, std::_Optional_base<clustering_key_prefix, false, false> >> = {<No data fields>}, 
    members of std::_Optional_base<clustering_key_prefix, false, false>:
    _M_payload = {
      <std::_Optional_payload<clustering_key_prefix, true, false, false>> = {
        <std::_Optional_payload_base<clustering_key_prefix>> = {
          _M_payload = {
            _M_empty = {<No data fields>},
            _M_value = {
              <prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>> = {
                <compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>> = {
                  _bytes = 0008000000000000ec29
                }, <No data fields>}, <No data fields>}
          },
          _M_engaged = true
        }, <No data fields>}, <No data fields>}
  }, 
  <std::_Enable_copy_move<true, true, true, true, std::optional<clustering_key_prefix> >> = {<No data fields>}, <No data fields>}

(gdb) p pos
$37 = {
  _type = partition_region::clustered,
  _bound_weight = bound_weight::before_all_prefixed,
  _ck = 0x60a031d54650
}
(gdb) p *pos._ck
$38 = {
  <prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>> = {
    <compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>> = {
      _bytes = 00080000000000010469
    }, <No data fields>}, <No data fields>}

$ build/dev/tools/scylla-types --compare --prefix-compound --type='org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)' 0008000000000000ec33 00080000000000010469
(60467) > (66665)

The range tombstone itself looks ok:

(gdb) p $13.start._bytes
$27 = 00080000000000010469
(gdb) p $13.end._bytes  
$28 = 00080000000000008235

$ build/dev/tools/scylla-types --compare --prefix-compound --type='org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)' 00080000000000010469 00080000000000008235
(66665) < (33333)

The remaining content of the reader's buffer:

(gdb) p this->_buffer
$40 = {
  _impl = {
    <std::allocator<mutation_fragment>> = {
      <__gnu_cxx::new_allocator<mutation_fragment>> = {<No data fields>}, <No data fields>}, 
    members of seastar::circular_buffer<mutation_fragment, std::allocator<mutation_fragment> >::impl:
    storage = 0x60a01367d1c0,
    begin = 6211,
    end = 6213,
    capacity = 4
  }
}
(gdb) p _buffer._impl.begin % 4
$41 = 3
(gdb) p _buffer._impl.end % 4
$42 = 1
(gdb) p $dereference_smart_ptr(_buffer._impl.storage[3]._data)._clustering_row._ck
$47 = {
  <prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>> = {
    <compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>> = {
      _bytes = 0008000000000000ec28
    }, <No data fields>}, <No data fields>}
(gdb) p $dereference_smart_ptr(_buffer._impl.storage[0]._data)._clustering_row._ck
$48 = {
  <prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>> = {
    <compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>> = {
      _bytes = 0008000000000000ec27
    }, <No data fields>}, <No data fields>}
(gdb) p $dereference_smart_ptr(_buffer._impl.storage[1]._data)._clustering_row._ck
$49 = {
  <prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>> = {
    <compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>> = {
      _bytes = 0008000000000000ec29
    }, <No data fields>}, <No data fields>}

Note that the order of the fragments in the buffer is not consistent:

$ build/dev/tools/scylla-types --compare --prefix-compound --type='org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)' 0008000000000000ec28 0008000000000000ec27
(60456) < (60455)
$ build/dev/tools/scylla-types --compare --prefix-compound --type='org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)' 0008000000000000ec27 0008000000000000ec29
(60455) > (60457)

Furthermore the key 0008000000000000ec29 is the same as that of the validator's previous row.

Nevermind, the buffer has two elements, the last one I printed is one that was popped already, probably the previous row seen by the validator.

core.scylla.996.726b0a32f57d44b69419d9271fb74b06.4183.1599663989000000

The same situations:

  • previous row's key (last fragment seen by validator): 0008000000000000ec33 (clustering row)
  • current pos: 00080000000000010469 (range tombstone)
  • _buffer[0]: 0008000000000000ec32
  • _buffer[1]: 0008000000000000ec31

We have rows in the correct order and a wild range tombstone in the middle. The range tombstone is the very same as in the previous core.

I see the same range tombstone in the two other cores.

@aleksbykov do we still have the cluster?

Nodes are available:

Name Ip address Current State Cloud Region
longevity-large-partitions-200k-pks-db-node-f93d5ecf-2 13.49.73.139 running aws eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-3 13.49.78.64 running aws eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-4 13.49.80.133 running aws eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-5 13.48.105.201 running aws eu-north-1
longevity-large-partitions-200k-pks-db-node-f93d5ecf-1 13.53.126.243 running aws eu-north-1
SCT-Runner 13.53.38.50 running aws eu-north-1
longevity-large-partitions-200k-pks-monitor-node-f93d5ecf-1 13.48.48.151 running aws eu-north-1

@denesb , yes it is available

@aleksbykov can you share the logs of node1 to node5.

For the record: node 5 was the repair master. node 1 to 4 were followers and crashed.

@aleksbykov can you please upload sstables for node5 for the repaired table somewhere? Also please share full schema.

So, after discussing with @asias, it seems that the bad data should be on the repair master. Repair master will not send any data to followers they already have. So if all 4 followers are writing this range tombstone it means only master has it.
There are 3 possible sources for the out-of-order range tombstone:

  • sstable
  • memtable
  • merging the two

We can probably exclude the first, as the cluster is running with key validation and we would have seen an earlier crash if the node would have attempted to write an out-of-order fragment to the sstable. Though there still a possibility of a bug in the sstable reader code.

@aleksbykov are there any commitlog files on node5?

@aleksbykov ping.

@asias @denesb

All db logs available by link: https://cloudius-jenkins-test.s3.amazonaws.com/f93d5ecf-da7f-4ec9-a9b0-97e33790be20/20200909_155904/db-cluster-f93d5ecf.zip

unfortunately, cluster was terminated by mistake, and only monitoring stack could be restored. So i can't give the answer whether on node5 any commit log files. If we have any metrics in scylla monitring stack for that, i can restore it and all data.

I will need to revisit https://github.com/scylladb/scylla/issues/5615 in order to catch this bug -- if we manage to reproduce.

@aleksbykov if we still have the metrics, can you please check the scylla_database_paused_reads_permit_based_evictions metric?

@denesb
Here is a screenshot:
Screenshot from 2020-09-16 15-34-05

Do you need any special query?

@aleksbykov can you filter for the repair master node and for label class=streaming?

Screenshot from 2020-09-16 16-03-20

Thanks @aleksbykov this is enough to confirm the possibility of the theory raised by @tgrabiec that this is likely yet another evictable reader reconstruction bug.

I will write a patch which will validate the start of the buffer after reader recreation to catch these at the source in the future.

I've written the validator for the evictable reader but it is already finding bugs in the code, just by running the tests.

Turns out these were bugs in the validation itself. Patch is on the list: [PATCH v1 0/3] evictable_reader: validate buffer on reader recreation

RPM with the validator backported to 4.2 branch can be found on http://scratch.scylladb.com/bdenes/7208/scylla-server-4.2.rc4-0.20200918.9fd1fe19d.x86_64.rpm, download it as:

gsutil cp gs://scratch.scylladb.com/bdenes/7208/scylla-server-4.2.rc4-0.20200918.9fd1fe19d.x86_64.rpm .

It's not a fix, rather its an additional validation that should help us catch the actual bug.

@denesb
I got coredump with your patch on same steps. But on this time i got only on same node where nemesis RepairStreamingErr running.

2020-09-21 06:04:41.000: (CoreDumpEvent Severity.ERROR): node=Node repo-7208-large-partitions-200k-pks-db-node-d6b51a4d-2 [13.53.207.107 | 10.0.1.62] (seed: False)
corefile_url=
https://storage.cloud.google.com/upload.scylladb.com/core.scylla.996.a04b42f561c648f4abf9723ae5f5716b.18873.1600668281000000/core.scylla.996.a04b42f561c648f4abf9723ae5f5716b.18873.1600668281000000.gz
backtrace=           PID: 18873 (scylla)
UID: 996 (scylla)
GID: 1001 (scylla)
Signal: 6 (ABRT)
Timestamp: Mon 2020-09-21 06:04:41 UTC (2min 27s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-11 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /
Boot ID: a04b42f561c648f4abf9723ae5f5716b
Machine ID: df877a200226bc47d06f26dae0736ec9
Hostname: repo-7208-large-partitions-200k-pks-db-node-d6b51a4d-2
Coredump: /var/lib/systemd/coredump/core.scylla.996.a04b42f561c648f4abf9723ae5f5716b.18873.1600668281000000
Message: Process 18873 (scylla) of user 996 dumped core.
Stack trace of thread 18891:
#0  0x00007f14edd999e5 raise (libc.so.6)
#1  0x00007f14edd8294d abort (libc.so.6)
#2  0x0000000002e4c2a3 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
#3  0x00000000011710e7 _ZN16evictable_reader36maybe_validate_position_in_partitionERKN7seastar15circular_bufferI17mutation_fragmentSaIS2_EEE (scylla)
#4  0x000000000117275d _ZZN16evictable_reader11fill_bufferER20flat_mutation_readerNSt6chrono10time_pointIN7seastar12lowres_clockENS2_8durationIlSt5ratioILl1ELl1000EEEEEEENKUlvE_clEv (scylla)
#5  0x0000000001173d0c _ZN7seastar12continuationINS_8internal22promise_base_with_typeIJEEEZN16evictable_reader11fill_bufferER20flat_mutation_readerNSt6chrono10time_pointINS_12lowres_clockENS7_8durationIlSt5ratioILl1ELl1000EEEEEEEUlvE_ZZNS_6futureIJEE14then_impl_nrvoISF_SH_EET0_OT_ENKUlvE_clEvEUlRS3_RSF_ONS_12future_stateIJEEEE_JEE15run_and_disposeEv (scylla)
#6  0x0000000002e81458 _ZN7seastar7reactor9run_tasksERNS0_10task_queueE (scylla)
#7  0x0000000002e817cf _ZN7seastar7reactor14run_some_tasksEv.part.0 (scylla)
#8  0x0000000002eb88ae _ZN7seastar7reactor3runEv (scylla)
#9  0x0000000002ec82ab _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#10 0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#11 0x00007f14ee7db432 start_thread (libpthread.so.0)
#12 0x00007f14ede5e913 __clone (libc.so.6)
Stack trace of thread 18898:
#0  0x00007f14ee7e59ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f14ee7db432 start_thread (libpthread.so.0)
#5  0x00007f14ede5e913 __clone (libc.so.6)
Stack trace of thread 18905:
#0  0x00007f14ee7e59ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f14ee7db432 start_thread (libpthread.so.0)
#5  0x00007f14ede5e913 __clone (libc.so.6)
Stack trace of thread 18895:
#0  0x00007f14ee7e59ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f14ee7db432 start_thread (libpthread.so.0)
#5  0x00007f14ede5e913 __clone (libc.so.6)
Stack trace of thread 18896:
#0  0x00007f14ee7e59ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f14ee7db432 start_thread (libpthread.so.0)
#5  0x00007f14ede5e913 __clone (libc.so.6)
Stack trace of thread 18899:
#0  0x00007f14ee7e59ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f14ee7db432 start_thread (libpthread.so.0)
#5  0x00007f14ede5e913 __clone (libc.so.6)
Stack trace of thread 18904:
#
download_instructions=
gsutil cp gs://upload.scylladb.com/core.scylla.996.a04b42f561c648f4abf9723ae5f5716b.18873.1600668281000000/core.scylla.996.a04b42f561c648f4abf9723ae5f5716b.18873.1600668281000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.996.a04b42f561c648f4abf9723ae5f5716b.18873.1600668281000000.gz

@aleksbykov thanks, yes this is expected, the "bad" data should come from the repair master so this is the only node which should trigger the validation error.

Downloading the core.

@denesb do you need live nodes or i can terminate them?

Also another one core in 10 minues:

2020-09-21 06:17:09.000: (CoreDumpEvent Severity.ERROR): node=Node repo-7208-large-partitions-200k-pks-db-node-d6b51a4d-2 [13.53.207.107 | 10.0.1.62] (seed: False)
corefile_url=
https://storage.cloud.google.com/upload.scylladb.com/core.scylla.996.82ae78cf49594ff7a1a6dba77a243bbc.1300.1600669029000000/core.scylla.996.82ae78cf49594ff7a1a6dba77a243bbc.1300.1600669029000000.gz
backtrace=           PID: 1300 (scylla)
UID: 996 (scylla)
GID: 1001 (scylla)
Signal: 6 (ABRT)
Timestamp: Mon 2020-09-21 06:17:09 UTC (2min 26s ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 500 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 0-11 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /
Boot ID: 82ae78cf49594ff7a1a6dba77a243bbc
Machine ID: df877a200226bc47d06f26dae0736ec9
Hostname: repo-7208-large-partitions-200k-pks-db-node-d6b51a4d-2
Coredump: /var/lib/systemd/coredump/core.scylla.996.82ae78cf49594ff7a1a6dba77a243bbc.1300.1600669029000000
Message: Process 1300 (scylla) of user 996 dumped core.
Stack trace of thread 1321:
#0  0x00007f19515059e5 raise (libc.so.6)
#1  0x00007f19514ee94d abort (libc.so.6)
#2  0x0000000002e4c2a3 _ZN7seastar17on_internal_errorERNS_6loggerESt17basic_string_viewIcSt11char_traitsIcEE (scylla)
#3  0x00000000011710e7 _ZN16evictable_reader36maybe_validate_position_in_partitionERKN7seastar15circular_bufferI17mutation_fragmentSaIS2_EEE (scylla)
#4  0x000000000117275d _ZZN16evictable_reader11fill_bufferER20flat_mutation_readerNSt6chrono10time_pointIN7seastar12lowres_clockENS2_8durationIlSt5ratioILl1ELl1000EEEEEEENKUlvE_clEv (scylla)
#5  0x0000000001173d0c _ZN7seastar12continuationINS_8internal22promise_base_with_typeIJEEEZN16evictable_reader11fill_bufferER20flat_mutation_readerNSt6chrono10time_pointINS_12lowres_clockENS7_8durationIlSt5ratioILl1ELl1000EEEEEEEUlvE_ZZNS_6futureIJEE14then_impl_nrvoISF_SH_EET0_OT_ENKUlvE_clEvEUlRS3_RSF_ONS_12future_stateIJEEEE_JEE15run_and_disposeEv (scylla)
#6  0x0000000002e81458 _ZN7seastar7reactor9run_tasksERNS0_10task_queueE (scylla)
#7  0x0000000002e817cf _ZN7seastar7reactor14run_some_tasksEv.part.0 (scylla)
#8  0x0000000002eb88ae _ZN7seastar7reactor3runEv (scylla)
#9  0x0000000002ec82ab _ZZN7seastar3smp9configureEN5boost15program_options13variables_mapENS_14reactor_configEENKUlvE1_clEv (scylla)
#10 0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#11 0x00007f1951f47432 start_thread (libpthread.so.0)
#12 0x00007f19515ca913 __clone (libc.so.6)
Stack trace of thread 1333:
#0  0x00007f1951f519ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f1951f47432 start_thread (libpthread.so.0)
#5  0x00007f19515ca913 __clone (libc.so.6)
Stack trace of thread 1338:
#0  0x00007f1951f519ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f1951f47432 start_thread (libpthread.so.0)
#5  0x00007f19515ca913 __clone (libc.so.6)
Stack trace of thread 1337:
#0  0x00007f1951f519ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f1951f47432 start_thread (libpthread.so.0)
#5  0x00007f19515ca913 __clone (libc.so.6)
Stack trace of thread 1334:
#0  0x00007f1951f519ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f1951f47432 start_thread (libpthread.so.0)
#5  0x00007f19515ca913 __clone (libc.so.6)
Stack trace of thread 1336:
#0  0x00007f1951f519ac read (libpthread.so.0)
#1  0x0000000003110147 _ZN7seastar11thread_pool4workENS_13basic_sstringIcjLj15ELb1EEE (scylla)
#2  0x00000000031103a8 _ZNSt17_Function_handlerIFvvEZN7seastar11thread_poolC4EPNS1_7reactorENS1_13basic_sstringIcjLj15ELb1EEEEUlvE_E9_M_invokeERKSt9_Any_data (scylla)
#3  0x0000000002e4c52e _ZN7seastar12posix_thread13start_routineEPv (scylla)
#4  0x00007f1951f47432 start_thread (libpthread.so.0)
#5  0x00007f19515ca913 __clone (libc.so.6)
Stack trace of thread 1335:
#0  0x000
download_instructions=
gsutil cp gs://upload.scylladb.com/core.scylla.996.82ae78cf49594ff7a1a6dba77a243bbc.1300.1600669029000000/core.scylla.996.82ae78cf49594ff7a1a6dba77a243bbc.1300.1600669029000000.gz .
gunzip /var/lib/systemd/coredump/core.scylla.996.82ae78cf49594ff7a1a6dba77a243bbc.1300.1600669029000000.gz

@denesb do you need live nodes or i can terminate them?

Can you leave alive only the node which crashed? The instance doesn't need to be up, I just need the content of its disk.

Ok. This node could be accessed by 13.53.207.107 it is located on eu-north-1 region

The first fragment in the buffer is a range tombstone: {start=, end=0008000000000000c34f}. This is emitted, despite the data being requested from position 000800000000000167be on.

$ build/dev/tools/scylla-types --compare -t 'org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)' --prefix-compound --value 000800000000000167be 0008000000000000c34f
(92094) < (49999)

The range tombstone is relevant to the range, so its emission is not a bug in and on itself. This might be another case of range tombstones not being trimmed to the read range, similar to https://github.com/scylladb/scylla/issues/4104.

@denesb I don't think readers are currently required to trim range tombstones to the query range (sstable readers don't), only to provide monotonic positions.

This could break evictable readers, which after unpausing may get a tombstone which starts behind already emitted fragments.

@tgrabiec do you think it would make sense to fix this inside the evictable reader? Should we make it to trim range tombstones on the start of the buffer?

I think requiring readers to not emit out-of-range fragments is a reasonable expectation, but it is not something we require right now. Imposing this restriction would require sweeping changes across all our readers.

@tgrabiec do you think it would make sense to fix this inside the evictable reader? Should we make it to trim range tombstones on the start of the buffer?

Yes. But there may be arbitrary amount of range tombstones, more than a single buffer, which precede the first row of the range.

I think requiring readers to not emit out-of-range fragments is a reasonable expectation, but it is not something we require right now. Imposing this restriction would require sweeping changes across all our readers.

Yes, probably only sstable readers are not doing this.

@tgrabiec do you think it would make sense to fix this inside the evictable reader? Should we make it to trim range tombstones on the start of the buffer?

Yes. But there may be arbitrary amount of range tombstones, more than a single buffer, which precede the first row of the range.

The evictable reader is already prepared for this, it will not stop reading until a position strictly larger than the last seen one is encountered to ensure forward progress.

I think requiring readers to not emit out-of-range fragments is a reasonable expectation, but it is not something we require right now. Imposing this restriction would require sweeping changes across all our readers.

Yes, probably only sstable readers are not doing this.

Yes, we fixed this for partition_snapshot_reader which is used by both memtable and cache.

So I propose that as a short-term fix we fix this in the evictable reader and I will open an issue to do the proper fix across all readers.

@tgrabiec do you concur?

On Mon, Sep 21, 2020 at 11:40 AM Botond D茅nes notifications@github.com
wrote:

@tgrabiec https://github.com/tgrabiec do you think it would make sense
to fix this inside the evictable reader? Should we make it to trim range
tombstones on the start of the buffer?

Yes. But there may be arbitrary amount of range tombstones, more than a
single buffer, which precede the first row of the range.

The evictable reader is already prepared for this, it will not stop
reading until a position strictly larger than the last seen one is
encountered to ensure forward progress.

I think requiring readers to not emit out-of-range fragments is a
reasonable expectation, but it is not something we require right now.
Imposing this restriction would require sweeping changes across all our
readers.

Yes, probably only sstable readers are not doing this.

Yes, we fixed this for partition_snapshot_reader which is used by both
memtable and cache.

cache is not using partition_snapshot_reader.

So I propose that as a short-term fix we fix this in the evictable reader
and I will open an issue to do the proper fix across all readers.

@tgrabiec https://github.com/tgrabiec do you concur?

Yes.

>

Ok, I'm already working on this.

@aleksbykov please rename the issue to: "evictable_reader: out-of-range range tombstones emitted on reader recreation cause fragment stream monotonicity violations"

Patch on the list: [PATCH v2 0/5] evictable_reader: validate buffer on reader recreation

v2 also includes a fix for this issue, in addition to a validator that should catch such issues in the future.

@aleksbykov please rename the issue to: "evictable_reader: out-of-range range tombstones emitted on reader recreation cause fragment stream monotonicity violations"

Done

@aleksbykov you can destroy any nodes left over from the cluster.

@tgrabiec please evaluate this for backport to 4.2

@denesb do we need this backported to 4.1 / 4.0

Checking with QA the weekend runs

@slivne yes, 4.0 and 4.1 are affected too.

Backported to 4.2 and 4.1

@tgrabiec why not 4.0?

Backported to 4.0 as well.

Was this page helpful?
0 / 5 - 0 ratings