Scylla: Support for SSTable "md" format (CASSANDRA-14861)

Created on 17 Apr 2019  路  11Comments  路  Source: scylladb/scylla

In a discussion in the Cassandra User email list, a user identified that his SSTables were of version "md":

I have noticed something since I upgraded to cassandra 3.0.18.

Before all my Sstable used to be named this way :

mc-130817-big-CompressionInfo.db
mc-130817-big-Data.db
mc-130817-big-Digest.crc32
mc-130817-big-Filter.db
mc-130817-big-Index.db
mc-130817-big-Statistics.db
mc-130817-big-Summary.db
mc-130817-big-TOC.txt

Since the update I have a new type of files :

```
md-20631-big-Statistics.db
md-20631-big-Filter.db
md-20631-big-TOC.txt
md-20631-big-Summary.db
md-20631-big-CompressionInfo.db
md-20631-big-Data.db
md-20631-big-Digest.crc32
md-20631-big-Index.db

He was told that this is related to CASSANDRA-14861.

sstable min/max metadata can cause data loss

            Key: CASSANDRA-14861
            URL: https://issues.apache.org/jira/browse/CASSANDRA-14861
        Project: Cassandra
     Issue Type: Bug
       Reporter: Blake Eggleston
       Assignee: Blake Eggleston
       Priority: Major
        Fix For: 3.0.18, 3.11.4, 4.0

There鈥檚 a bug in the way we filter sstables in the read path that can cause sstables containing relevant range tombstones to be excluded from reads. This can cause data resurrection for an individual read, and if compaction timing is right, permanent resurrection via read repair.
We track the min and max clustering values when writing an sstable so we can avoid reading
from sstables that don鈥檛 contain the clustering values we鈥檙e looking for in a given read. The min max for each clustering column are updated for each row / RT marker we write. In the case of range tombstones markers though, we only update the min max for the clustering values they contain, which is almost never the full set of clustering values. This leaves a min/max that are above/below (respectively) the real ranges covered by the range tombstone contained in the sstable.
For instance, assume we鈥檙e writing an sstable for a table with 3 clustering values. The current min clustering is 5:6:7. We write an RT marker for a range tombstone that deletes any row with the value 4 in the first clustering value so the open marker is [4:]. This would make the new min clustering 4:6:7 when it should really be 4:. If we do a read for clustering values of 4:5 and lower, we鈥檒l exclude this sstable and it鈥檚 range tombstone, resurrecting any data there that this tombstone would have deleted.

Fix was committed to Apache Cassandra in October 2018.

I would presume (but it would have to be tested) that this same defect occurs in Scylla. It would behoove us to upgrade our SSTables to support this version to ease migration from the lastest versions of Cassandra, and also to ensure that the root defect does not impact our users.

Eng-2 enhancement

Most helpful comment

@bhalevy I don;'t think this is urgent - do we have users that are hitting this

A workaround for now is to rename the files from md to mc and load the files - this should work.

All 11 comments

I'll look into it to see if we're exposed to the same issue and to understand if there's any on-disk format change in md vs. mc.

Cc @tgrabiec @haaawk @slivne

We're not exposed.

We detected this problem and disabled this optimization in https://github.com/scylladb/scylla/issues/3553.

Enabling this optimization back is part of https://github.com/scylladb/scylla/issues/4042

@bhalevy I don;'t think this is urgent - do we have users that are hitting this

A workaround for now is to rename the files from md to mc and load the files - this should work.

@slivne per your question, we did have a single user who hit it

@penberg are you working on this ?

@penberg are you working on this ?

@slivne I took @penberg's patch and taking from there.
There's more into it (like fixing the way we generate min/max clustering keys metadata to be compatible with the md format)

@bhalevy if there are bugs "fixing the way we generate min/max clustering keys metadata to be compatible with the md format" - please open them and reference

@bhalevy if there are bugs "fixing the way we generate min/max clustering keys metadata to be compatible with the md format" - please open them and reference

I think we can just put it under this issue since it's basically incompatibility with Cassandra's md format.

New feature, not backporting.

Was this page helpful?
0 / 5 - 0 ratings