Scylla: sstables::malformed_sstable_exception (non-null component after null component)

Created on 16 Nov 2020 · 42Comments · Source: scylladb/scylla

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at [email protected] or in our slack channel.

[x] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): 4.1.6-0.20200831.6d9ff622df
Cluster size: 7 (temporarily, 4 normally)
OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-112-generic x86_64)

We're having the following error for different sstables on all nodes of our cluster since yesterday.

Nov 16 16:26:55 scylla-01 scylla:  [shard 12] compaction_manager - compaction failed: sstables::malformed_sstable_exception (non-null component after null component in sstable /var/lib/scylla/data/rick/sessions-c6a9b530abc111eaa092000000000001/mc-86626-big-Data.db): retrying

Maybe worth noting that we have it even on the node that's joining cluster right now.

DESCRIBE TABLE rick.sessions result

CREATE TABLE rick.sessions (
    app_id int,
    client_id text,
    type text,
    ga_sessioncount int,
    ga_date int,
    all_landing_page_path text,
    app_date text,
    attribution_id uuid,
    attribution_version int,
    campaign_type text,
    channel_group text,
    channel_type text,
    client_segment text,
    client_user_ids list<text>,
    cohorts map<text, int>,
    costs_multiplier double,
    ga_adcontent text,
    ga_adgroup text,
    ga_admatchedquery text,
    ga_adplacementdomain text,
    ga_adwordsadgroupid text,
    ga_adwordscampaignid text,
    ga_adwordscreativeid text,
    ga_adwordscustomerid text,
    ga_appversion text,
    ga_browser text,
    ga_browserversion text,
    ga_campaign text,
    ga_channelgrouping text,
    ga_city text,
    ga_country text,
    ga_datasource text,
    ga_datehourminute_max bigint,
    ga_datehourminute_min bigint,
    ga_devicecategory text,
    ga_eventactions set<text>,
    ga_hostname text,
    ga_keyword text,
    ga_landingpagepath text,
    ga_mobiledeviceinfo text,
    ga_networkdomain text,
    ga_operatingsystem text,
    ga_pagepaths set<text>,
    ga_region text,
    ga_screennames set<text>,
    ga_screenresolution text,
    ga_sourcemedium text,
    ga_transactionid_transactionrevenue map<text, decimal>,
    grabber_id uuid,
    initial_landing_page_path text,
    landing_page_path text,
    metric_hits map<text, int>,
    metric_hits_first_click map<text, int>,
    metric_hits_indirect_free map<text, int>,
    metric_hits_post_click map<text, int>,
    real_source_medium text,
    segment text static,
    transactionid_transactionrevenue_first_click map<text, decimal>,
    transactionid_transactionrevenue_indirect_free map<text, decimal>,
    transactionid_transactionrevenue_post_click map<text, decimal>,
    user_id text,
    user_ids set<text> static,
    white_list_landing_page_path text,
    PRIMARY KEY ((app_id, client_id), type, ga_sessioncount, ga_date)
) WITH CLUSTERING ORDER BY (type ASC, ga_sessioncount ASC, ga_date ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'LeveledCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

sstabledump for the mentioned sstable fails with the following error:

Exception in thread "main" org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /var/lib/scylla/data/rick/sessions-c6a9b530abc111eaa092000000000001/mc-86626-big-Data.db
    at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:134)
    at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100)
    at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:32)
    at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
    at org.apache.cassandra.tools.JsonTransformer.serializePartition(JsonTransformer.java:206)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
    at org.apache.cassandra.tools.JsonTransformer.toJson(JsonTransformer.java:101)
    at org.apache.cassandra.tools.SSTableExport.run(SSTableExport.java:253)
    at com.scylladb.tools.SSTableExport.main(SSTableExport.java:44)
Caused by: java.io.IOException: Corrupt (negative) value length encountered
    at org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java:439)
    at org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:255)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.readSimpleColumn(UnfilteredSerializer.java:659)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$deserializeRowBody$1(UnfilteredSerializer.java:624)
    at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1221)
    at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1225)
    at org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1176)
    at org.apache.cassandra.db.Columns.apply(Columns.java:384)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:620)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:482)
    at org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:438)
    at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:95)
    at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:73)
    at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
    at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:122)
    ... 17 more

Health check data and file sample upload UUID: 442158e9-791c-4a08-a3ec-5247af3505ec

User Request bug

Source

Flamefork

Most helpful comment

I have a POC that is already successfully parsing the provided sstable. I sent an RFC to the mailing list: https://groups.google.com/g/scylladb-dev/c/U6QjCItNLd8/m/VBqN8mQxAQAJ

denesb on 24 Nov 2020

❤1 🎉1

All 42 comments

@Flamefork thanks

I expect we may need the full components for the sstable you shared not only the Data part - can you upload all of mc-86626-* parts

@bhalevy - the upload has the schema and a sample sstable Data part - lets see if that is enough to start working on this.

slivne on 17 Nov 2020

🙏

Uploaded mc-86626-big-others.tar.gz to the same UUID.

Flamefork on 17 Nov 2020

Looking at closed issues that had these errors (https://github.com/scylladb/scylla/issues/4009)

@Flamefork I checked the schema you have some text fields as part of the partition / cluster key - could it be that those are non UTF-8

Thanks for uploading the other parts - we should be able to extract the needed info now

slivne on 17 Nov 2020

I think it's possible, but I haven't seen it in our data yet. From the key ((app_id, client_id), type, ga_sessioncount, ga_date) text fields, type comes from our code, but client_id is a free-form id, that usually comes in one of two common formats, but in theory, it could be anything.

Flamefork on 17 Nov 2020

Just in case, is there anything else we could provide/do to help debugging it?

Flamefork on 17 Nov 2020

FWIW, sstabldump also agrees the sstable is corrupted:

Exception in thread "main" org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /data/7623/442158e9-791c-4a08-a3ec-5247af3505ec/mc-86626-big-Data.db
        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:134)
        at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100)
        at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:32)
        at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
        at org.apache.cassandra.tools.JsonTransformer.serializePartition(JsonTransformer.java:206)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at org.apache.cassandra.tools.JsonTransformer.toJson(JsonTransformer.java:101)
        at org.apache.cassandra.tools.SSTableExport.run(SSTableExport.java:253)
        at com.scylladb.tools.SSTableExport.main(SSTableExport.java:44)
Caused by: java.io.IOException: Corrupt (negative) value length encountered
        at org.apache.cassandra.db.marshal.AbstractType.readValue(AbstractType.java:439)
        at org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:255)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.readSimpleColumn(UnfilteredSerializer.java:659)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$deserializeRowBody$1(UnfilteredSerializer.java:624)
        at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1221)
        at org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1225)
        at org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1176)
        at org.apache.cassandra.db.Columns.apply(Columns.java:384)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeRowBody(UnfilteredSerializer.java:620)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.deserializeOne(UnfilteredSerializer.java:482)
        at org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:438)
        at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:95)
        at org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:73)
        at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:122)
        ... 17 more

bhalevy on 17 Nov 2020

Reproduced on master with the given sstable.

denesb on 17 Nov 2020

(gdb) scylla schema $dereference_smart_ptr(_sst)._schema
(schema*) 0x60000728e000 ks="rick" cf="sessions" id=cacf9d70-28ce-11eb-b5f2-000000000000 version=c7b6ee27-9820-3bd4-80f1-68908756466b

partition key: byte_order_equal=true byte_order_comparable=false is_reversed=false
    "org.apache.cassandra.db.marshal.Int32Type"
    "org.apache.cassandra.db.marshal.UTF8Type"

clustering key: byte_order_equal=true byte_order_comparable=false is_reversed=false
    "org.apache.cassandra.db.marshal.UTF8Type"
    "org.apache.cassandra.db.marshal.Int32Type"
    "org.apache.cassandra.db.marshal.Int32Type"

(gdb) p $dereference_smart_ptr(_column_translation._state).clustering_column_value_fix_lengths
$16 = std::vector of length 3, capacity 3 = {std::optional<unsigned int> [no contained value], std::optional<unsigned int> = {[contained value] = 4}, std::optional<unsigned int> = {[contained value] = 4}}

(gdb) p _ck_blocks_header_offset
$18 = 2 # we are reading the third component

(gdb) p _row_key
$19 = std::vector of length 0, capacity 3 # how can this be empty?

(gdb) p/x _ck_blocks_header
$25 = 0x1e # b00011110 -> (null, null, not-null)

```

denesb on 18 Nov 2020

@Flamefork do you use Thrift at all?

denesb on 18 Nov 2020

@denesb No, we only use native transport (95% scylladb/gocql, 5% python scylla-driver)

Flamefork on 18 Nov 2020

Write path seems unprotected against non-trailing null components, so if one somehow makes it into memory we'll happily write it to sstables. There is only a comment explaining the assumption:
https://github.com/scylladb/scylla/blob/f55b522c1bdc25dbae56c1e1ff6fe1c14c9ecc60/sstables/mx/writer.cc#L196-L200

denesb on 18 Nov 2020

Cc @tgrabiec

bhalevy on 18 Nov 2020

The sstable is corrupt. If I uncomment the exception to let parsing continue, I hit this:

Exception when communicating with 127.0.0.1, to read from rick.sessions: sstables::malformed_sstable_exception (Too big ttl: 45220846477190 in sstable /home/bdenes/Desktop/scylladb/data/rick/sessions-cacf9d7028ce11ebb5f2000000000000/mc-86626-big-Data.db

Data just prior to the bad key seems sane.

denesb on 19 Nov 2020

@Botond Dénes bdenes@scylladb.com please create an issue for the missing
item in the write path.

We have sstable scrub - can it skip the bad partition - so we can extract
everything but this partition ?

Do you have any lead as to what may have caused this ?

Can we skip the bad partition ?

On Thu, Nov 19, 2020 at 11:00 AM Botond Dénes notifications@github.com
wrote:

The sstable is corrupt. If I uncomment the exception to let parsing
continue, I hit this:

Exception when communicating with 127.0.0.1, to read from rick.sessions: sstables::malformed_sstable_exception (Too big ttl: 45220846477190 in sstable /home/bdenes/Desktop/scylladb/data/rick/sessions-cacf9d7028ce11ebb5f2000000000000/mc-86626-big-Data.db

Data just prior to the bad key seems sane.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/7623#issuecomment-730228710,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA2OCCCZHMVDTASJMTEGNPLSQTNCBANCNFSM4TXMWEVQ
.

slivne on 19 Nov 2020

Sstable scrub is not currently capable of doing this (which reminds me: we should document what it can currently do). We can however try to teach it to do this. We can add an index-based skipping read to sstable scrub. This would work like a normal read until an exception is hit during parsing. When this happens, scrub would stop and try to jump over the corrupt bits using the index. It would try skipping to the closest position ahead we have an index entry for. When doing this it would log the clustering range it skipped over. It would repeat this until it reaches EOF. This method assumes the index is intact.

denesb on 19 Nov 2020

Do you have any lead as to what may have caused this ?

Not yet.

denesb on 19 Nov 2020

please create an issue for the missing item in the write path.

Done: https://github.com/scylladb/scylla/issues/7650

denesb on 19 Nov 2020

@tgrabiec do you think this index-based skipping is feasible?

denesb on 19 Nov 2020

@denesb to add on this: I've seen few instances of Too big ttl error in our logs too. Would it help if I find and upload such stable too?
UPD: uploaded to UUID c1cbb1aa-8b33-4f77-8285-a7aba9ec7ecd

Flamefork on 19 Nov 2020

If I further comment out the Too big ttl error, I hit a large allocation request of more than 4GB. As the entire sstable is just 15MB, this is very likely a genuine corruption.

denesb on 19 Nov 2020

Fully parsed the it looks like this is a null row:

     6 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - FLAGS_2 _flags=0x0
     7 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - FLAGS_2 !ext flags; CLUSTERING_ROW
     8 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CLUSTERING_ROW
     9 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK
    10 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK_HEADER
    11 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2
    12 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 - null
    13 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK                                                                                                                                                                                             
    14 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 
    15 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 - null 
    16 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK 
    17 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 
    18 ERROR 2020-11-19 10:52:54,902 [shard 0] sstable - Error while parsing clustering-key: non-null component after null component
    19 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK2 - empty
    20 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - CK_BLOCK
    21 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - mp_row_consumer_m 0x6000000aadb0: consume_row_start(ckp{0000})
    22 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - mp_row_consumer_m 0x6000000aadb0: emit
    23 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - mp_row_consumer_m 0x6000000aadb0: consume_row_marker_and_tombstone({row_marker: }, {tombstone: none}, {tombstone: none}), key={position: clustered,ckp{0000},0}
    24 TRACE 2020-11-19 10:52:54,902 [shard 0] sstable - data_consume_row_context 0x600000897000: state=46, size=1098 _i=470

denesb on 19 Nov 2020

@Flamefork was there a point in time after which you started experiencing these errors? If yes, is there an event that you can relate to it (update to new version, or something else)?

denesb on 19 Nov 2020

@denesb hmm, just noticed. Not sure if relevant though.
First appearance in logs @ Nov 14 12:41:14 UTC.
This is less than an hour after a new node was up in a cluster, but error appeared on another, old node (and then started appearing more and more on all other nodes), probably that's why we haven't noticed this fact earlier.

Flamefork on 19 Nov 2020

With compact storage,
CREATE TABLE cf (pk int, ck1 int, ck2 int, ck3 int, rc int, primary key (pk, ck1, ck2, ck3)) WITH COMPACT STORAGE;

I am able to produce a compound key with trailing nulls:

 pk | ck1 | ck2  | ck3  | rc
----+-----+------+------+----
  1 |   1 | null | null |  1

With insert I was unable to create a compound key where a non-null component follows a null one.

However, with range tombstone, created as follow, I am able to create a sstable which cannot be parsed by sstabledump:
DELETE from ks.cf WHERE pk=1 and pk2=1 AND ck=1 AND ck2=null AND ck3 <= 1;

Follow the error when running dump on it:

$ ./tools/bin/sstabledump ~/scylla/data/data/ks/cf-4a7aca702a7a11eb87b0000000000000/md-2-big-Data.db 
Using /home/raphaelsc/scylla/conf/scylla.yaml as the config file
[
  {
    "partition" : {
      "key" : [ "1", "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 42,
        "clustering" : [ 1 ] } ] } ]Exception in thread "main" java.lang.NullPointerException
    at org.apache.cassandra.serializers.Int32Serializer.deserialize(Int32Serializer.java:31)
    at org.apache.cassandra.serializers.Int32Serializer.deserialize(Int32Serializer.java:25)
    at org.apache.cassandra.db.marshal.Int32Type.toJSONString(Int32Type.java:101)
    at org.apache.cassandra.tools.JsonTransformer.serializeClustering(JsonTransformer.java:350)
    at org.apache.cassandra.tools.JsonTransformer.serializeRow(JsonTransformer.java:244)
    at org.apache.cassandra.tools.JsonTransformer.serializePartition(JsonTransformer.java:211)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
    at org.apache.cassandra.tools.JsonTransformer.toJson(JsonTransformer.java:101)
    at org.apache.cassandra.tools.SSTableExport.run(SSTableExport.java:253)
    at com.scylladb.tools.SSTableExport.main(SSTableExport.java:44)

Here https://github.com/scylladb/scylla/issues/7650#issuecomment-730404847, @denesb shared the code in writer which is supposed to handle this, but perhaps it's not sufficient for range tombstones

raphaelsc on 19 Nov 2020

Opened https://github.com/scylladb/scylla/issues/7658 for the sstable scrub feature.

denesb on 19 Nov 2020

An interesting development: if I remove this 49 bytes from the data:

001e000400001f910000143935353639313136312e31353739373837303339007fffffff80000000000000008001020003

The sstable parses without any problems and the data looks sane.
Maybe there is just one corrupt row that throws off the parsing.

denesb on 20 Nov 2020

Actually there are some rows with null clustering keys, so maybe this was such a row, and serializing it went sideways.

denesb on 20 Nov 2020

I tried some further mutations of the bad row to see if I can make it parse but failed.

I think that the key to what happened here are the null keys. We've seen these in the past, although those were null partition keys. I still don't know where they are coming from, but we should add a similar validation tot he write path to what we already have for partition keys.

I started looking into making sstablescrub be able to skip over corrupt parts in the file. In the case of the sstable this ability should be enough to be able to salvage the remains of the file.

denesb on 20 Nov 2020

An interesting development: if I remove this 49 bytes from the data:
001e000400001f910000143935353639313136312e31353739373837303339007fffffff80000000000000008001020003
The sstable parses without any problems and the data looks sane.
Maybe there is just one corrupt row that throws off the parsing.

With my example above, I was able to create a sstable, using range tombstone (with one of the cks set to null) on a table created with compact storage which has a partition written to it with trailing null cks) that cannot be parsed by sstable dump yet can be parsed by us (tested it with major). probably some variant of it is exactly the bug being faced here.

raphaelsc on 20 Nov 2020

@Flamefork how much is this issue impeding your ability to use the cluster? We are asking because we want to see how to prioritize the nodetool scrub feature vs. continuing to dig to the bottom of what caused this and adding protections against it happening in the future.

denesb on 23 Nov 2020

@denesb We have one faulty node in the cluster that we cannot fix: replacing it fails cause streaming cannot receive all needed data cause of a few malformed_sstable_exceptions on sending nodes (tried 2 times), and we can't repair it due to a crashloop. Also, all nodes have 100% CPU load due to compaction fail loop with malformed_sstable_exception. RP=3 + QUORUM reads and writes allows our services to work for now, but 1. at a much higher risk of having another node down 2. with much higher latencies. Which actually shows how resilient ScyllaDB under such conditions :) but right now I'm worried that any other minor incident in the cluster would add up to become a real disaster.
Not sure what's the best strategy here, tried to sum up our situation to provide full data for consideration.

Flamefork on 23 Nov 2020

An interesting development: if I remove this 49 bytes from the data:
001e000400001f910000143935353639313136312e31353739373837303339007fffffff80000000000000008001020003
The sstable parses without any problems and the data looks sane.
Maybe there is just one corrupt row that throws off the parsing.
With my example above, I was able to create a sstable, using range tombstone (with one of the cks set to null) on a table created with compact storage which has a partition written to it with trailing null cks) that cannot be parsed by sstable dump yet can be parsed by us (tested it with major). probably some variant of it is exactly the bug being faced here.

This might also be a result of sstabledump being out-of-date. As it is not based on our own sstable parser, there might be artificial incompatibilities that can result in this. We should really have an sstabledump based on our own source tree.

denesb on 23 Nov 2020

I have a POC that is already successfully parsing the provided sstable. I sent an RFC to the mailing list: https://groups.google.com/g/scylladb-dev/c/U6QjCItNLd8/m/VBqN8mQxAQAJ

denesb on 24 Nov 2020

❤1 🎉1

Update: I have moved the discussion with @Flamefork to slack to avoid spamming this issue. To summarize the current state:

The scrub solution didn't work, the corruption is not at all limited to just non-null component after null component, there are other parse errors, some don't trigger any errors instead cause the parser to parse data belonging to other partitions as rows belonging to a preceding one.
Reproduction attempts failed.
I developed an index-reader tool that allows listing the partitions in the affected sstables.
Currently I'm working on a validator wired in just below the sstable writer that is supposed to catch the corrupt data and prevent it from getting to the disk.

denesb on 16 Dec 2020

Another user reported to have seen the non-null component after null component error on the user mailing list: https://groups.google.com/g/scylladb-users/c/EzR8xhVZouc/m/6ip0aJImBAAJ

denesb on 16 Dec 2020

@tgrabiec / @bhalevy / @denesb - please reference the commits you have done related to this issue - I want to keep track of them via the github issue for backports

slivne on 13 Jan 2021

@slivne the validating patchset was merged to master as https://github.com/scylladb/scylla/commit/6cfc949e6217ac987dbed6b53c10e399cb5cf61e.

denesb on 13 Jan 2021

Just to update on the investigation: we concluded that the cause is a missing partition-end fragment, it explains all symptoms. To prevent this from happening in the future we moved the mutation fragment stream validator (which should have caught this) already present in one of the sstable write paths, to the sstable writer itself, so all sstable write paths are covered.

denesb on 13 Jan 2021

Managed to reproduce this in a mixed cluster (3 nodes with 8 CPU, 3 nodes with 24CPU), after deleting a few sstables and issuing a repair. After starting the repair a compaction on the repair written sstables hits the issue. The good news is that the validator we rolled out catches it.
Curiously, contrary to what I expected, the bug seems to be in the sstable reader, which when the last clustering row of a partition is the last fragment in the buffer, on the next buffer fill decides to move over to the next partition, forgetting to emit a partition end:

Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_column(id={55}, path=, value=6f74686572, ts=1610258705682050, ttl=0, del_time=9223372036854775807, deleted=0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_complex_column_start({9}, {tombstone: timestamp=1610258705682049, deletion_time=1610258705})
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_column(id={9}, path=66697273745f73657373696f6e, value=013461bd, ts=1610258705682050, ttl=0, del_time=9223372036854775807, deleted=0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_complex_column_end({9})
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - fill_buffer() - start
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500 bound 0x60700573a558: advance_to({-7992537618814688209, pk{000400001f910015313434353933383035352e31363130313938323836}, w=0}), _previous_summary_idx=1, _current_summary_idx=0
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: summary_idx=0
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: advance_to_page(0), bound 0x60700573a558
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: same page
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: old page index = 87
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: new page index = 90, pos=113647
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: advance_to({position: clustered,ckp{},-1}), current data_file_pos=113647
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - index 0x60700573a500: no promoted index
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - data_consume_rows_context 0x60700083bc00: skip_to(112744 -> 113647, el=0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: reset(0)
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: consume_partition_start(deltime=(2147483647, -9223372036854775808)), _is_mutation_end=1
Feb 04 14:24:36 ip-172-31-46-136.eu-west-1.compute.internal scylla[11507]:  [shard 7] sstable - mp_row_consumer_m 0x6070076f8ad0: setup_for_partition(pk{0004000011ac0015313039383734393238342e31353938343632303534})

Don't know the reason yet.

denesb on 4 Feb 2021

This is caused by out-of-order partition in the index, which throws off the index reader, causing it to return a later partition when asked to advance to the current one. The out-of-order partitions in turn are written by repair, the exact cause is not yet clear, but we suspect that a semaphore is reordering in-flight data.

denesb on 5 Feb 2021

The reordering was found to happen in the RPC streaming code, in smp::submit_to(). smp::submit_to() doesn't provide ordering guarantees and it seems that in some cases indeed it might reorder tasks.

denesb on 15 Feb 2021

We already have an issue for this reordering: https://github.com/scylladb/scylla/issues/7552. So far we thought that this could only happen in debug mode, but this issue proved that this is not the case. Closing this issue as a duplicate of https://github.com/scylladb/scylla/issues/7552, as the problems seen in this issue are just symptoms of the problem whose root cause is #7552.

denesb on 16 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Provide Scylla Helm Charts to support Kubernetes installations

eyalgutkind · 3Comments

Bad performance on default CentOS/RHEL 7.0 kernel

avikivity · 4Comments

Release process: create AMI in more regions

tzach · 7Comments

Scylla_setup: still ask for nic in non-interactive mode

tarzanek · 6Comments

Support Datetime functions

tzach · 3Comments