Scylla: sstables::malformed_sstable_exception (non-null component after null component)

Created on 16 Nov 2020  ·  42Comments  ·  Source: scylladb/scylla

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at [email protected] or in our slack channel.

  • [x] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): 4.1.6-0.20200831.6d9ff622df
Cluster size: 7 (temporarily, 4 normally)
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-112-generic x86_64)


We're having the following error for different sstables on all nodes of our cluster since yesterday.

Nov 16 16:26:55 scylla-01 scylla:  [shard 12] compaction_manager - compaction failed: sstables::malformed_sstable_exception (non-null component after null component in sstable /var/lib/scylla/data/rick/sessions-c6a9b530abc111eaa092000000000001/mc-86626-big-Data.db): retrying

Maybe worth noting that we have it even on the node that's joining cluster right now.


DESCRIBE TABLE rick.sessions result

CREATE TABLE rick.sessions (
    app_id int,
    client_id text,
    type text,
    ga_sessioncount int,
    ga_date int,
    all_landing_page_path text,
    app_date text,
    attribution_id uuid,
    attribution_version int,
    campaign_type text,
    channel_group text,
    channel_type text,
    client_segment text,
    client_user_ids list<text>,
    cohorts map<text, int>,
    costs_multiplier double,
    ga_adcontent text,
    ga_adgroup text,
    ga_admatchedquery text,
    ga_adplacementdomain text,
    ga_adwordsadgroupid text,
    ga_adwordscampaignid text,
    ga_adwordscreativeid text,
    ga_adwordscustomerid text,
    ga_appversion text,
    ga_browser text,
    ga_browserversion text,
    ga_campaign text,
    ga_channelgrouping text,
    ga_city text,
    ga_country text,
    ga_datasource text,
    ga_datehourminute_max bigint,
    ga_datehourminute_min bigint,
    ga_devicecategory text,
    ga_eventactions set<text>,
    ga_hostname text,
    ga_keyword text,
    ga_landingpagepath text,
    ga_mobiledeviceinfo text,
    ga_networkdomain text,
    ga_operatingsystem text,
    ga_pagepaths set<text>,
    ga_region text,
    ga_screennames set<text>,
    ga_screenresolution text,
    ga_sourcemedium text,
    ga_transactionid_transactionrevenue map<text, decimal>,
    grabber_id uuid,
    initial_landing_page_path text,
    landing_page_path text,
    metric_hits map<text, int>,
    metric_hits_first_click map<text, int>,
    metric_hits_indirect_free map<text, int>,
    metric_hits_post_click map<text, int>,
    real_source_medium text,
    segment text static,
    transactionid_transactionrevenue_first_click map<text, decimal>,
    transactionid_transactionrevenue_indirect_free map<text, decimal>,
    transactionid_transactionrevenue_post_click map<text, decimal>,
    user_id text,
    user_ids set<text> static,
    white_list_landing_page_path text,
    PRIMARY KEY ((app_id, client_id), type, ga_sessioncount, ga_date)
) WITH CLUSTERING ORDER BY (type ASC, ga_sessioncount ASC, ga_date ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

sstabledump for the mentioned sstable fails with the following error:

Exception in thread "main" org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /var/lib/scylla/data/rick/sessions-c6a9b530abc111eaa092000000000001/mc-86626-big-Data.db
    at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:134)
    at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100)
    at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:32)
    at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
    at org.apache.cassandra.tools.JsonTransformer.serializePartition(JsonTransformer.java:206)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
    at org.apache.cassandra.tools.JsonTransformer.toJson(JsonTransformer.java:101)
    at org.apache.cassandra.tools.SSTableExport.run(SSTableExport.java:253)
    at com.scylladb.tools.SSTableExport.main(SSTableExport.java:44)
Caused by: java.io.IOException: Corrupt (negative) value length encountered
    at org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java:439)
    at org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:255)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.readSimpleColumn(UnfilteredSerializer.java:659)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$deserializeRowBody$1(UnfilteredSerializer.java:624)
    at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1221)
    at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1225)
    at org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1176)
    at org.apache.cassandra.db.Columns.apply(Columns.java:384)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:620)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:482)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:438)
    at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:95)
    at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:73)
    at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
    at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:122)
    ... 17 more

Health check data and file sample upload UUID: 442158e9-791c-4a08-a3ec-5247af3505ec

User Request bug

Most helpful comment

I have a POC that is already successfully parsing the provided sstable. I sent an RFC to the mailing list: https://groups.google.com/g/scylladb-dev/c/U6QjCItNLd8/m/VBqN8mQxAQAJ

All 42 comments

@Flamefork thanks

  • I expect we may need the full components for the sstable you shared not only the Data part - can you upload all of mc-86626-* parts

@bhalevy - the upload has the schema and a sample sstable Data part - lets see if that is enough to start working on this.

🙏

Uploaded mc-86626-big-others.tar.gz to the same UUID.

Looking at closed issues that had these errors (https://github.com/scylladb/scylla/issues/4009)

@Flamefork I checked the schema you have some text fields as part of the partition / cluster key - could it be that those are non UTF-8

Thanks for uploading the other parts - we should be able to extract the needed info now

I think it's possible, but I haven't seen it in our data yet. From the key ((app_id, client_id), type, ga_sessioncount, ga_date) text fields, type comes from our code, but client_id is a free-form id, that usually comes in one of two common formats, but in theory, it could be anything.

Just in case, is there anything else we could provide/do to help debugging it?

FWIW, sstabldump also agrees the sstable is corrupted:

Exception in thread "main" org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /data/7623/442158e9-791c-4a08-a3ec-5247af3505ec/mc-86626-big-Data.db
        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:134)
        at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100)
        at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:32)
        at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
        at org.apache.cassandra.tools.JsonTransformer.serializePartition(JsonTransformer.java:206)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.apache.cassandra.tools.JsonTransformer.toJson(JsonTransformer.java:101)
        at org.apache.cassandra.tools.SSTableExport.run(SSTableExport.java:253)
        at com.scylladb.tools.SSTableExport.main(SSTableExport.java:44)
Caused by: java.io.IOException: Corrupt (negative) value length encountered
        at org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java:439)
        at org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:255)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.readSimpleColumn(UnfilteredSerializer.java:659)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$deserializeRowBody$1(UnfilteredSerializer.java:624)
        at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1221)
        at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1225)
        at org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1176)
        at org.apache.cassandra.db.Columns.apply(Columns.java:384)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:620)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:482)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:438)
        at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:95)
        at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:73)
        at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:122)
        ... 17 more

Reproduced on master with the given sstable.

(gdb) scylla schema $dereference_smart_ptr(_sst)._schema
(schema*) 0x60000728e000 ks="rick" cf="sessions" id=cacf9d70-28ce-11eb-b5f2-000000000000 version=c7b6ee27-9820-3bd4-80f1-68908756466b

partition key: byte_order_equal=true byte_order_comparable=false is_reversed=false
    "org.apache.cassandra.db.marshal.Int32Type"
    "org.apache.cassandra.db.marshal.UTF8Type"

clustering key: byte_order_equal=true byte_order_comparable=false is_reversed=false
    "org.apache.cassandra.db.marshal.UTF8Type"
    "org.apache.cassandra.db.marshal.Int32Type"
    "org.apache.cassandra.db.marshal.Int32Type"

(gdb) p $dereference_smart_ptr(_column_translation._state).clustering_column_value_fix_lengths
$16 = std::vector of length 3, capacity 3 = {std::optional<unsigned int> [no contained value], std::optional<unsigned int> = {[contained value] = 4}, std::optional<unsigned int> = {[contained value] = 4}}

(gdb) p _ck_blocks_header_offset
$18 = 2 # we are reading the third component

(gdb) p _row_key
$19 = std::vector of length 0, capacity 3 # how can this be empty?

(gdb) p/x _ck_blocks_header
$25 = 0x1e # b00011110 -> (null, null, not-null)

```

@Flamefork do you use Thrift at all?

@denesb No, we only use native transport (95% scylladb/gocql, 5% python scylla-driver)

Write path seems unprotected against non-trailing null components, so if one somehow makes it into memory we'll happily write it to sstables. There is only a comment explaining the assumption:
https://github.com/scylladb/scylla/blob/f55b522c1bdc25dbae56c1e1ff6fe1c14c9ecc60/sstables/mx/writer.cc#L196-L200

Cc @tgrabiec

The sstable is corrupt. If I uncomment the exception to let parsing continue, I hit this:

Exception when communicating with 127.0.0.1, to read from rick.sessions: sstables::malformed_sstable_exception (Too big ttl: 45220846477190 in sstable /home/bdenes/Desktop/scylladb/data/rick/sessions-cacf9d7028ce11ebb5f2000000000000/mc-86626-big-Data.db

Data just prior to the bad key seems sane.

@Botond Dénes bdenes@scylladb.com please create an issue for the missing
item in the write path.

We have sstable scrub - can it skip the bad partition - so we can extract
everything but this partition ?

Do you have any lead as to what may have caused this ?

Can we skip the bad partition ?

On Thu, Nov 19, 2020 at 11:00 AM Botond Dénes notifications@github.com
wrote:

The sstable is corrupt. If I uncomment the exception to let parsing
continue, I hit this:

Exception when communicating with 127.0.0.1, to read from rick.sessions: sstables::malformed_sstable_exception (Too big ttl: 45220846477190 in sstable /home/bdenes/Desktop/scylladb/data/rick/sessions-cacf9d7028ce11ebb5f2000000000000/mc-86626-big-Data.db

Data just prior to the bad key seems sane.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/7623#issuecomment-730228710,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCCZHMVDTASJMTEGNPLSQTNCBANCNFSM4TXMWEVQ
.

Sstable scrub is not currently capable of doing this (which reminds me: we should document what it can currently do). We can however try to teach it to do this. We can add an index-based skipping read to sstable scrub. This would work like a normal read until an exception is hit during parsing. When this happens, scrub would stop and try to jump over the corrupt bits using the index. It would try skipping to the closest position ahead we have an index entry for. When doing this it would log the clustering range it skipped over. It would repeat this until it reaches EOF. This method assumes the index is intact.

Do you have any lead as to what may have caused this ?

Not yet.

please create an issue for the missing item in the write path.

Done: https://github.com/scylladb/scylla/issues/7650

@tgrabiec do you think this index-based skipping is feasible?

@denesb to add on this: I've seen few instances of Too big ttl error in our logs too. Would it help if I find and upload such stable too?
UPD: uploaded to UUID c1cbb1aa-8b33-4f77-8285-a7aba9ec7ecd

If I further comment out the Too big ttl error, I hit a large allocation request of more than 4GB. As the entire sstable is just 15MB, this is very likely a genuine corruption.

Fully parsed the it looks like this is a null row:

     6 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - FLAGS_2 _flags=0x0
     7 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - FLAGS_2 !ext flags; CLUSTERING_ROW
     8 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CLUSTERING_ROW
     9 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK
    10 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK_HEADER
    11 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2
    12 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 - null
    13 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK                                                                                                                                                                                             
    14 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 
    15 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 - null 
    16 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK 
    17 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 
    18 ERROR 2020-11-19 10:52:54,902 [shard 0] sstable - Error while parsing clustering-key: non-null component after null component
    19 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 - empty
    20 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK
    21 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - mp_row_consumer_m 0x6000000aadb0: consume_row_start(ckp{0000})
    22 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - mp_row_consumer_m 0x6000000aadb0: emit
    23 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - mp_row_consumer_m 0x6000000aadb0: consume_row_marker_and_tombstone({row_marker: }, {tombstone: none}, {tombstone: none}), key={position: clustered,ckp{0000},0}
    24 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - data_consume_row_context 0x600000897000: state=46, size=1098 _i=470

@Flamefork was there a point in time after which you started experiencing these errors? If yes, is there an event that you can relate to it (update to new version, or something else)?

@denesb hmm, just noticed. Not sure if relevant though.
First appearance in logs @ Nov 14 12:41:14 UTC.
This is less than an hour after a new node was up in a cluster, but error appeared on another, old node (and then started appearing more and more on all other nodes), probably that's why we haven't noticed this fact earlier.

With compact storage,
CREATE TABLE cf (pk int, ck1 int, ck2 int, ck3 int, rc int, primary key (pk, ck1, ck2, ck3)) WITH COMPACT STORAGE;

I am able to produce a compound key with trailing nulls:

 pk | ck1 | ck2  | ck3  | rc
----+-----+------+------+----
  1 |   1 | null | null |  1

With insert I was unable to create a compound key where a non-null component follows a null one.

However, with range tombstone, created as follow, I am able to create a sstable which cannot be parsed by sstabledump:
DELETE from ks.cf WHERE pk=1 and pk2=1 AND ck=1 AND ck2=null AND ck3 <= 1;

Follow the error when running dump on it:

$ ./tools/bin/sstabledump ~/scylla/data/data/ks/cf-4a7aca702a7a11eb87b0000000000000/md-2-big-Data.db 
Using /home/raphaelsc/scylla/conf/scylla.yaml as the config file
[
  {
    "partition" : {
      "key" : [ "1", "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 42,
        "clustering" : [ 1 ] } ] } ]Exception in thread "main" java.lang.NullPointerException
    at org.apache.cassandra.serializers.Int32Serializer.deserialize(Int32Serializer.java:31)
    at org.apache.cassandra.serializers.Int32Serializer.deserialize(Int32Serializer.java:25)
    at org.apache.cassandra.db.marshal.Int32Type.toJSONString(Int32Type.java:101)
    at org.apache.cassandra.tools.JsonTransformer.serializeClustering(JsonTransformer.java:350)
    at org.apache.cassandra.tools.JsonTransformer.serializeRow(JsonTransformer.java:244)
    at org.apache.cassandra.tools.JsonTransformer.serializePartition(JsonTransformer.java:211)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
    at org.apache.cassandra.tools.JsonTransformer.toJson(JsonTransformer.java:101)
    at org.apache.cassandra.tools.SSTableExport.run(SSTableExport.java:253)
    at com.scylladb.tools.SSTableExport.main(SSTableExport.java:44)

Here https://github.com/scylladb/scylla/issues/7650#issuecomment-730404847, @denesb shared the code in writer which is supposed to handle this, but perhaps it's not sufficient for range tombstones

Opened https://github.com/scylladb/scylla/issues/7658 for the sstable scrub feature.

An interesting development: if I remove this 49 bytes from the data:

001e000400001f910000143935353639313136312e31353739373837303339007fffffff80000000000000008001020003

The sstable parses without any problems and the data looks sane.
Maybe there is just one corrupt row that throws off the parsing.

Actually there are some rows with null clustering keys, so maybe this was such a row, and serializing it went sideways.

I tried some further mutations of the bad row to see if I can make it parse but failed.

I think that the key to what happened here are the null keys. We've seen these in the past, although those were null partition keys. I still don't know where they are coming from, but we should add a similar validation tot he write path to what we already have for partition keys.

I started looking into making sstablescrub be able to skip over corrupt parts in the file. In the case of the sstable this ability should be enough to be able to salvage the remains of the file.

An interesting development: if I remove this 49 bytes from the data:

001e000400001f910000143935353639313136312e31353739373837303339007fffffff80000000000000008001020003

The sstable parses without any problems and the data looks sane.
Maybe there is just one corrupt row that throws off the parsing.

With my example above, I was able to create a sstable, using range tombstone (with one of the cks set to null) on a table created with compact storage which has a partition written to it with trailing null cks) that cannot be parsed by sstable dump yet can be parsed by us (tested it with major). probably some variant of it is exactly the bug being faced here.

@Flamefork how much is this issue impeding your ability to use the cluster? We are asking because we want to see how to prioritize the nodetool scrub feature vs. continuing to dig to the bottom of what caused this and adding protections against it happening in the future.

@denesb We have one faulty node in the cluster that we cannot fix: replacing it fails cause streaming cannot receive all needed data cause of a few malformed_sstable_exceptions on sending nodes (tried 2 times), and we can't repair it due to a crashloop. Also, all nodes have 100% CPU load due to compaction fail loop with malformed_sstable_exception. RP=3 + QUORUM reads and writes allows our services to work for now, but 1. at a much higher risk of having another node down 2. with much higher latencies. Which actually shows how resilient ScyllaDB under such conditions :) but right now I'm worried that any other minor incident in the cluster would add up to become a real disaster.
Not sure what's the best strategy here, tried to sum up our situation to provide full data for consideration.

An interesting development: if I remove this 49 bytes from the data:

001e000400001f910000143935353639313136312e31353739373837303339007fffffff80000000000000008001020003

The sstable parses without any problems and the data looks sane.
Maybe there is just one corrupt row that throws off the parsing.

With my example above, I was able to create a sstable, using range tombstone (with one of the cks set to null) on a table created with compact storage which has a partition written to it with trailing null cks) that cannot be parsed by sstable dump yet can be parsed by us (tested it with major). probably some variant of it is exactly the bug being faced here.

This might also be a result of sstabledump being out-of-date. As it is not based on our own sstable parser, there might be artificial incompatibilities that can result in this. We should really have an sstabledump based on our own source tree.

I have a POC that is already successfully parsing the provided sstable. I sent an RFC to the mailing list: https://groups.google.com/g/scylladb-dev/c/U6QjCItNLd8/m/VBqN8mQxAQAJ

Update: I have moved the discussion with @Flamefork to slack to avoid spamming this issue. To summarize the current state:

  • The scrub solution didn't work, the corruption is not at all limited to just non-null component after null component, there are other parse errors, some don't trigger any errors instead cause the parser to parse data belonging to other partitions as rows belonging to a preceding one.
  • Reproduction attempts failed.
  • I developed an index-reader tool that allows listing the partitions in the affected sstables.
  • Currently I'm working on a validator wired in just below the sstable writer that is supposed to catch the corrupt data and prevent it from getting to the disk.

Another user reported to have seen the non-null component after null component error on the user mailing list: https://groups.google.com/g/scylladb-users/c/EzR8xhVZouc/m/6ip0aJImBAAJ

@tgrabiec / @bhalevy / @denesb - please reference the commits you have done related to this issue - I want to keep track of them via the github issue for backports

@slivne the validating patchset was merged to master as https://github.com/scylladb/scylla/commit/6cfc949e6217ac987dbed6b53c10e399cb5cf61e.

Just to update on the investigation: we concluded that the cause is a missing partition-end fragment, it explains all symptoms. To prevent this from happening in the future we moved the mutation fragment stream validator (which should have caught this) already present in one of the sstable write paths, to the sstable writer itself, so all sstable write paths are covered.

Managed to reproduce this in a mixed cluster (3 nodes with 8 CPU, 3 nodes with 24CPU), after deleting a few sstables and issuing a repair. After starting the repair a compaction on the repair written sstables hits the issue. The good news is that the validator we rolled out catches it.
Curiously, contrary to what I expected, the bug seems to be in the sstable reader, which when the last clustering row of a partition is the last fragment in the buffer, on the next buffer fill decides to move over to the next partition, forgetting to emit a partition end:

Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_column(id={55}, path=, value=6f74686572, ts=1610258705682050, ttl=0, del_time=9223372036854775807, deleted=0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_complex_column_start({9}, {tombstone: timestamp=1610258705682049, deletion_time=1610258705})
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_column(id={9}, path=66697273745f73657373696f6e, value=013461bd, ts=1610258705682050, ttl=0, del_time=9223372036854775807, deleted=0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_complex_column_end({9})
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - fill_buffer() - start
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500 bound 0x60700573a558: advance_to({-7992537618814688209, pk{000400001f910015313434353933383035352e31363130313938323836}, w=0}), _previous_summary_idx=1, _current_summary_idx=0
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: summary_idx=0
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: advance_to_page(0), bound 0x60700573a558
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: same page
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: old page index = 87
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: new page index = 90, pos=113647
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: advance_to({position: clustered,ckp{},-1}), current data_file_pos=113647
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: no promoted index
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - data_consume_rows_context 0x60700083bc00: skip_to(112744 -> 113647, el=0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: reset(0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_partition_start(deltime=(2147483647, -9223372036854775808)), _is_mutation_end=1
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: setup_for_partition(pk{0004000011ac0015313039383734393238342e31353938343632303534})

Don't know the reason yet.

This is caused by out-of-order partition in the index, which throws off the index reader, causing it to return a later partition when asked to advance to the current one. The out-of-order partitions in turn are written by repair, the exact cause is not yet clear, but we suspect that a semaphore is reordering in-flight data.

The reordering was found to happen in the RPC streaming code, in smp::submit_to(). smp::submit_to() doesn't provide ordering guarantees and it seems that in some cases indeed it might reorder tasks.

We already have an issue for this reordering: https://github.com/scylladb/scylla/issues/7552. So far we thought that this could only happen in debug mode, but this issue proved that this is not the case. Closing this issue as a duplicate of https://github.com/scylladb/scylla/issues/7552, as the problems seen in this issue are just symptoms of the problem whose root cause is #7552.

Was this page helpful?
0 / 5 - 0 ratings