Note: I haven't confirmed this behavior in practice, it's currently theoretical from reading code. Holding off on the severity label until it's confirmed. The fix will need to be backported into 2.0
When computing which timestamps are present in an sstable for the time bound iterator performance optimization, all keys without a timestamp are ignored, which includes intents. This means that certain (probably unlikely) compactions will result in a situation where a time bounded iterator reading from [t1,t2] will miss an intent at t2.
Backup uses timebounded iterators and does a read at a certain timestamp until all intents below that timestamp have been resolved one way or another. This bug means that backup could incorrectly proceed with an unresolved intent. If this intent is later rolled back, the backup would contain a value that was never committed.
CDC will have similar correctness issues.
Several fixes are possible with various performance tradeoffs. An partial list:
Backup uses timebounded iterators and does a read at a certain timestamp until all intents below that timestamp have been resolved one way or another. This bug means that backup could incorrectly proceed with an unresolved intent. If this intent is later rolled back, the backup would contain a value that was never committed.
I don't think this exact scenario is possible. If the time bound iterator accidentally excludes a lingering intent from an aborted transaction, it's no problem, as the backup wasn't supposed to contain that intent anyway. The problem, as far as I can see, arises when the iterator excludes a lingering intent from a committed transaction; that intent needed to be promoted to a write, but the backup will not include that write.
When an intent is laid down, we write two keys: the MVCCMetadata key at ts=0 that indicates that an intent exists and the provisional value at ts=value_ts that holds the intent's value. When we commit an intent, we actually just delete the MVCCMetadata key, effectively "revealing" the provisional value as committed.
I think the issue here is that if an intent's MVCCMetadata key is split in a different sst from its provisional value key, it's possible that a timebound iterator scans only over the sst with the provisional value key. In that case, the value will look committed even though it's not! I think the hazard arises when the intent is either aborted or pushed before being committed. If this happens, the backup would contain a value that was never committed or was committed at a later timestamp.
Gah! My understanding was still subtly wrong. Thanks for the clarification, @nvanbenschoten.
There's now two other solutions, though I'm not sure how well supported by RocksDB they are. One is to simply disallow splitting a metadata key from its actual values. The other is to only look at the last key in an SST. If it's a metadata key, then we need to look in and see if there's an intent. Only in that case do we need to erase the timestamp bounds on the SST.
Also, no wonder we haven't seen this in practice.
I'm upgrading this to S-1 to hopefully get more visibility. I think this needs to get fixed for v2.1. We haven't seen this in practice, but that's the nature of this bug: you would never notice when it happened.
@petermattis putting this on your radar. Is there anyone who can take a look at this?
One is to simply disallow splitting a metadata key from its actual values.
This isn't possible as far as I'm aware.
The other is to only look at the last key in an SST. If it's a metadata key, then we need to look in and see if there's an intent.
Are you thinking this would be done at compaction time?
Is there anyone who can take a look at this?
Hrmm, the number of folks familiar with the C++ code is too small.
Are you thinking this would be done at compaction time?
Yeah, at SST build time. I'm hoping there's some way to do this with the table properties collector interface.
Yeah every time you see an intent, copy the serialized proto to a field in the table prop collector. Every time you see a non intent, clear it. Then when the finalize method is called, if the field is non empty, deserialize the proto and process the timestamp
I can take a crack at doing this, but may require some assistance in setting up a test scenario which fails. Sounds like what we want is a situation where there are two sstables where the first contains the MVCC metadata key and the second contains the provisional key. I'm thinking I can create that scenario using ingested sstables. Then I perform a backup and the sstable which contains the provisional key should be included in the backup incorrectly, right?
I can take a crack at doing this
:tada:
Sounds like what we want is a situation where there are two sstables where the first contains the MVCC metadata key and the second contains the provisional key. I'm thinking I can create that scenario using ingested sstables. Then I perform a backup and the sstable which contains the provisional key should be included in the backup incorrectly, right?
Yes, that's my understanding. The SST with the metadata key (SSTm) will need to have time bounds that are older than the SST with the provisional key. Then the backup needs to be incremental on top of a full backup that occurred _after_ the timestamp of any key in SSTm. I think. Getting such a full backup is going to be tricky. I think it'll be easier if you just create an MVCCIncrementalIterator over a store with the desired SSTs and make sure that you don't see the provisional key.
Backups never include provisional keys, they retry until it's resolved one way or the other. I think the high-level issue here is that the timebound iterator would incorrectly skip the sstable, and the backup wouldn't know it had to wait for it to be resolved.
Oh, I misunderstood what you meant by provisional key. What @benesch said.
Yes, that's my understanding. The SST with the metadata key (SSTm) will need to have time bounds that are older than the SST with the provisional key. Then the backup needs to be incremental on top of a full backup that occurred after the timestamp of any key in SSTm. I think. Getting such a full backup is going to be tricky. I think it'll be easier if you just create an MVCCIncrementalIterator over a store with the desired SSTs and make sure that you don't see the provisional key.
So perhaps I should do this fully within storage/engine? Can one of you point me to the backup code which uses MVCCIncrementalIterator? Is this the code in ccl/storageccl/export.go?
Yeah, I think it's fine (and much simpler) to test at the storage/engine level. You found it: https://github.com/cockroachdb/cockroach/blob/62c976dafc7f48f0a52142310a8fdc42908c30a5/pkg/ccl/storageccl/export.go#L164-L200
@petermattis and I were just discussing a variant of this that could explain https://github.com/cockroachdb/cockroach/issues/31823. What we're seeing there is an intent that appears to come back to life when using a time-bound iterator. This is causing a transaction to retry indefinitely and stall all other intents because the txn's Scan does not see the intent but its RefreshRange does.
The theory is that an mvcc metadata key's RocksDB deletion tombstone gets put in a separate sst than its original key. The original key is now properly accounted for in the sst min timestamp, but the deletion tombstone is not because it is also written at timestamp 0 (and doesn't have a value we can peek at!). It's possible that nothing else in the deletion tombstone's sst pushes it above the time-bound iterator's lower bound timestamp, which means that the time-bound iterator used in RefreshRange observes the original mvcc metadata key, but not the RocksDB tombstone that deletes it.
It seems reasonable that this is now causing issues like we see on adriatic because of #31290, which went in around the time that we started seeing stalls. That change is completely valid, but it may be exacerbating this other issue and making it much more likely.
That analysis makes sense, but how does #31290 make this occurrence more likely?
Even though our time-bound iterators specify both a min and a max timestamp, in practice it is the minimum bound that provides all the benefit. I think if we changed the iterators to only specify a lower bound, and then we included now() into the sstable at time of writing, we'd be able to preserve a large portion of the performance benefit here. (I don't see a good way to get a more precise bound given the limitations of the rocksdb APIs)
In a compaction, can we access the properties of the input sstables, or are the properties constructed solely from the KV pairs? If we could get the input properties, then we'd only need to put now() into the L0 sstables and then take the max of the inputs in each subsequent compaction.
@danhhz Can you put some eyes on this and think through how this affects incremental restores?
Chatting with @spencerkimball about this and his thinking is similar to @bdarnell's: we configure the timebound iterator from [txn-start, max). I'm not sure this is sufficient. The deletion tombstone for the intent won't have any timestamp information that can affect the timestamp bounds for the SST. It seems possible for that deletion tombstone to end up surrounded by other keys that have timestamps older than txn-start if those keys are written by different transactions.
In a compaction, can we access the properties of the input sstables, or are the properties constructed solely from the KV pairs? If we could get the input properties, then we'd only need to put now() into the L0 sstables and then take the max of the inputs in each subsequent compaction.
We don't currently get access to the properties of the input sstables. The properties are solely constructed from the KV pairs. I think that could be changed, though there are downsides as then the timestamp bounds are not tight.
We could also adjust table_filter to process range deletions even if the SST is skipped. I don't think that would require any changes either to how we record SST timestamps or determine whether an SST should be skipped. It might be irritating to implement, though.
I'll note that this is more evidence in favor of having a separate storage location (i.e., not SSTs) for range deletions.
@benesch The deletion tombstones being discussed are point deletions. For once, range deletions are not involved.
but how does #31290 make this occurrence more likely
We discussed this in person - my working theory is that this issue is only easily reproducible under high contention, when transactions may take a lot of time between writing intents and committing them. This would allow lots of writes at old timestamps and exaggerate the chance for the mvcc metadata's deletion tombstone to be the operation that should be at the highest timestamp in an sst. I then think #31290 makes this more likely to occur because before that change, it would be likely that both the mvcc metadata write key and deletion tombstone would both be missed by a timebound iterator if either hit this case (essentially canceling each other out). Now, only the deletion tombstone is possible to miss, which is likely the cause of the adriatic issues.
Chatting with @spencerkimball about this and his thinking is similar to @bdarnell's: we configure the timebound iterator from [txn-start, max). I'm not sure this is sufficient.
Right, that could cause the same issue to occur in reverse. If we performed a historical query that should see a deletion tombstone for an intent, we're not guaranteed to if we ignore certain ssts.
@benesch The deletion tombstones being discussed are point deletions. For once, range deletions are not involved.
Ha! Thanks.
It seems like we could perhaps use the RocksDB merge operator here to tag a deletion of a metadata key with a timestamp. That is, rather than calling RocksDB->Delete(metadata_key), we'd call RocksDB->Merge(metadata_key, {TYPE_DELETE, ts, real_metadata}). Then the table properties collector would be able to discover the relevant timestamp for the deletion tombstone, because it would be in the value, and the merge operator would do the work of suppressing the key. That would be a backwards compatibility nightmare, though.
Right, that could cause the same issue to occur in reverse. If we performed a historical query that should see a deletion tombstone for an intent, we're not guaranteed to if we ignore certain ssts.
My general take away is that ignoring certain SSTs is super tricky.
For the RefreshRange and ResolveIntent usage of timebound iterator, I think we want semantics where if we select an SST for the filter, we also automatically select overlapping SSTs that are at higher levels of the LSM. I think this is the moral equivalent of what the [txn-start, max) proposal is trying to do.
It seems like we could perhaps use the RocksDB merge operator here to tag a deletion of a metadata key with a timestamp. That is, rather than calling RocksDB->Delete(metadata_key), we'd call RocksDB->Merge(metadata_key, {TYPE_DELETE, ts, real_metadata}). Then the table properties collector would be able to discover the relevant timestamp for the deletion tombstone, and the merge operator would do the work of suppressing the key. That's going to be a backwards compatibility nightmare, though.
a) Supporting multiple merge operators is a pain as RocksDB doesn't give us an easy mechanism for doing this. b) Are you proposing extending merge operators so they can perform deletions? I don't think merge operators support something like that right now.
A similar idea would be to extend the RocksDB Delete operation to accept a value. This would likely require the addition of a new operation type. This also seems like a compatibility nightmare.
a) Supporting multiple merge operators is a pain as RocksDB doesn't give us an easy mechanism for doing this. b) Are you proposing extending merge operators so they can perform deletions? I don't think merge operators support something like that right now.
a) We could conditionally change the behavior of the merge operator based on the type of the value. The existing merge operator bails out if the value isn't tagged with roachpb::TIMESERIES. We could simply adjust that branch to instead perform the new behavior.
b) Oh, I just assumed that they did support deletions. Shoot.
Chatting with @spencerkimball about this and his thinking is similar to @bdarnell's: we configure the timebound iterator from [txn-start, max). I'm not sure this is sufficient.
Right, that could cause the same issue to occur in reverse. If we performed a historical query that should see a deletion tombstone for an intent, we're not guaranteed to if we ignore certain ssts.
I don't see how historical reads would be a problem with my proposal (which was to always use maxint for the upper bound of the iterator, and to give each sstables max-timestamp property the current time. Historical reads would be unlikely to be able to exclude many sstables, but the ones they exclude would be old enough to be safe.
For the RefreshRange and ResolveIntent usage of timebound iterator, I think we want semantics where if we select an SST for the filter, we also automatically select overlapping SSTs that are at higher levels of the LSM. I think this is the moral equivalent of what the [txn-start, max) proposal is trying to do.
I like this. I think my ideal solution would be to introduce a delete-with-payload operation that would let us pass the timestamp along, but the compatibility issues there seem more complicated. Expanding the set of selected sstables to include any higher level ones that overlap with one already selected should get us the correct results without changing anything on disk. Can we do this with the available interfaces?
Expanding the set of selected sstables to include any higher level ones that overlap with one already selected should get us the correct results without changing anything on disk. Can we do this with the available interfaces?
My guess is no, though I haven't investigated thoroughly. My intuition is that this wouldn't be hard to do. Certainly much easier than trying to introduce a delete-with-payload operation.
If we can rely on tables being accessed in a predictable order by level, we could do this fairly easily by adding some local state around the table_filter closure. It looks like this is currently close to true, but not quite: Version::AddIterators proceeds from L0 to Ln in order, but within a level the individual SSTs are opened lazily, so we won't necessarily know that there's an overlapping SST in L(n+1) when opening an SST in Ln.
We can query the sstable metadata with GetSSTables and produce a pre-computed set for our table_filter to consult. This seems like it might be expensive, but maybe it would be OK for the few uses of time-bound iterators and worthwhile in comparison to the number of sstable accesses saved.
Hrmm, good observation about the lazy loading of SSTs.
We can query the sstable metadata with GetSSTables and produce a pre-computed set for our table_filter to consult. This seems like it might be expensive, but maybe it would be OK for the few uses of time-bound iterators and worthwhile in comparison to the number of sstable accesses saved.
DB::GetLiveFilesMetaData (which powers GetSSTables) doesn't return the table properties which contains the timestamp bounds.
I wonder if we can estimate the timestamp for a deletion tombstone by using the timestamps for other keys in the SST being generated. Specifically, if a tombstone has a sequence number less than other entries in the same SST, we can assume its timestamp is also less than those entries. If the tombstone has a sequence number greater than other entries in the SST we clear the timestamp bounds for the SST which will force the SST to be used for time-bound queries.
I think this breaks because sequence number order does not imply timestamp order. That is, a transaction can write an entry at an older timestamp than a previous write to a different key by a different transaction. The closed-timestamp stuff implies a limit on how far in the past these historical writes can take place, right?
Right, sequence numbers don't imply timestamp order. The closed-timestamp mechanism lets us dynamically discover a lower bound on how far back things can be changed, but it's not a simple constant offset from the current time.
I don't think we need to worry about sequence numbers, though. The issue here is that the deletion of an MVCCMetadata key (which corresponds to an intent resolution) needs to be tagged with the timestamp of the value it is associated with, which must have been written in a different SST (or else it would be included in the bounds and we wouldn't have to do anything here).
I think if we give each SST that includes a deletion tombstone for a metadata key (in practice this will probably mean all SSTs except those in the bottommost level) a max-timestamp that is at least as high as any previously-written SST, that might be enough.
This might lead to some mid-level SSTs having a max timestamp that is too high, leading us to open too many files on reads. If this is a problem, I think it's possible to be clever about levels to tighten the bounds (only ratchet the timestamp based on ssts written to the same or lower levels), although I don't think that data is currently exposed to the property collector.
We still need to catch an instance of this bug to analyze. I'll keep an eye on the adriatic metrics and stop the world if I can catch it during a stall (anyone else with access, feel free to do the same).
I think if we give each SST that includes a deletion tombstone for a metadata key (in practice this will probably mean all SSTs except those in the bottommost level) a max-timestamp that is at least as high as any previously-written SST, that might be enough.
The unfortunate part of this approach is that a compaction at L6 will give the SST a new max-timestamp. We'd really like the max-timestamp to be derived from the input SSTs. I suppose we can hack something together so that TablePropertiesCollector can perform some initialization based on the input table properties.
The unfortunate part of this approach is that a compaction at L6 will give the SST a new max-timestamp.
That's why I specified "each SST that includes a deletion tombstones". L6 sstables will never contain tombstones, so their timestamp bounds are accurately determined by the keys they contain. This is only a problem for the intermediate levels.
L6 sstables will never contain tombstones, so their timestamp bounds are accurately determined by the keys they contain. This is only a problem for the intermediate levels.
L6 sstables can contain tombstones if a snapshot is active when the sstable is created. Perhaps that is rare enough not to worry about.
Summarizing to get the discussion started again. Timebound iterators can incorrectly skip sstables containing intent tombstones because those tombstones do not contain any usable timestamp information to populate the sstable timestamp bounds. This can lead to a deleted intent incorrectly reappearing for operations which use timebound iterators: EndTransaction, RefreshRange and ResolveIntent. We also use timebound iterators for incremental backup and range feeds. So this is pretty bad.
The options so far:
RefreshRange has used timebound iterators since the introduction of that operation in https://github.com/cockroachdb/cockroach/pull/21140/commits. Does anyone recall how important these uses of timebound iterators are?I'm not thrilled with any of these options right now.
Cc @bdarnell, @benesch, @danhhz, @nvanbenschoten
Another option:
I've got a workaround that is imperfect but addresses the major cases and should be easy to deploy:
Whenever a time-bound iterator returns a potentially-anomalous result, retry with a non-time-bound iterator. I've focused on RefreshRange; I haven't analyzed the other uses of TBIs but RefreshRange is the one responsible for the adriatic stalls. The vast majority of anomalous results are from intents that were resolved as committed, so whenever RefreshRange encounters an intent, it looks up that key with a regular iterator. If no intent is found, the result from the TBI was a false positive and we can proceed. The cost of this in terms of extra disk i/o is small since we're only doing point operations on the regular iterator, so the prefix bloom filter will be in effect.
This leaves cases where an intent was resolved as aborted. Most of the time the deletion of the MVCCMetadata and the deletion of the intent value will end up in the same sstable, so the bounds will be set correctly; the only problem is when the sst split point falls just in between those two keys. When this happens, RefreshRange will effectively perform a dirty read and see the uncommitted intent as if it were committed. This is not as bad as it sounds because it won't return the dirty data, but it will still lead to the same stall loop. We could either accept that this is low probability and leave it for another day or we could do the same trick to retry with a regular iterator for every non-intent value encountered by RefreshRange.
I propose implementing this workaround for RefreshRange (only in the case where it encounters an intent, I think) as a quick fix for 2.1.1. That should stop the adriatic/tpcc stalls and buy us time to consider a longer-term solution. (Since changing the table filter doesn't seem workable at this point I'm currently favoring the delete-with-payload option. My current analysis of rocksdb suggests that there's space in the sst format for a payload, it's just hardcoded to an empty string. There will be a lot of plumbing work to expose this but I think it'll have the forward and backward compatibility we need.).
Does anyone recall how important these uses of timebound iterators are?
IIRC they're a huge win for incremental backup.
For intent resolution, I was originally an advocate for their use, but this was based on a gut feeling and now that we have reason to doubt the safety of TBIs I think it might be reasonable to turn them off here. Most transactions would not be affected; only those that do DeleteRange or write to enough keys that their intents get collapsed.
For RefreshRange, we've never measured with and without TBIs, but I think this one would show up in TPCC numbers. It affects most transactions that experience any contention.
Since changing the table filter doesn't seem workable at this point I'm currently favoring the delete-with-payload option. My current analysis of rocksdb suggests that there's space in the sst format for a payload, it's just hardcoded to an empty string. There will be a lot of plumbing work to expose this but I think it'll have the forward and backward compatibility we need.
The sstable has room for a payload on delete operations, but the batch format which is written to the WAL does not. See the comment in storage/engine/batch.go. Deletions are encoded in the batch as kTypeDeletion varstring (i.e. only the key), while put operations are encoded as kTypeValue varstring varstring (key, value).
Similar to the previous idea, but without changing RocksDB, try to pass timestamp info for the intent in the intent tombstone. The only place to pass info is in the key itself. That sounds like a non-starter, but we control the key comparison routine. Is it possible to add a bit of hidden "suffix" to keys which doesn't affect equality? If we were starting from scratch this would be feasible. Not sure about retrofitting it. This approach would likely prohibit rolling back of an upgrade.
I took a look at this and think it is feasible. MVCC keys are encoded as <key>\x00[<wall_time>[<logical>]]<#timestamp-bytes>. Currently, <#timestamp-bytes> is always one of 0, 8, or 12. If we're willing to bump COCKROACH_VERSION (which would prevent downgrades), we can steal another value from here to indicate a timestamp is present, but that this is an MVCC metadata key. For example, let's say we use the high-bit to indicate we have a metadata key and the 7-low bits of <#timestamp-bytes> as the length of the timestamp data. During comparison (DBComparator::{Compare,Equal}) we'd have to normalize the intent timestamp to empty if that high-bit is set. During compaction (TimeBoundTblPropCollector::AddUserKey) we'd extract the timestamp from the intent tombstone. This is tricky, but I don't see why it wouldn't work and it would be entirely contained in Cockroach code. One bit of complexity is auditing the code to find all of the places we decode raw MVCC keys. Hopefully, this is all done through the code in libroach and enginepb.DecodeKey.
Note that bumping COCKROACH_VERSION is likely required for the delete-with-value or merge-delete approaches. We could try to soften this by only bumping the version when the cluster version has been upgraded, though that would prevent a backport of any of these approaches to 2.1.
What does rocksdb do with keys that are not bit-identical but are equivalent according to the comparator? If an intent is repeatedly added and removed you could have SST files with two metadata tombstones at different timestamps getting merged into one lower-level SST. Would it recognize that only one (and only the newer one) should appear in the new SST?
Note that bumping COCKROACH_VERSION is likely required for the delete-with-value or merge-delete approaches. We could try to soften this by only bumping the version when the cluster version has been upgraded, though that would prevent a backport of any of these approaches to 2.1.
Are you saying that bumping COCKROACHDB_VERSION would be acceptable in 2.1.x but adding a cluster version would not? I don't agree - the reason we don't introduce new cluster versions in patch releases is because we want to guarantee the ability to downgrade to earlier patch releases. In this respect introducing a new cluster version is the safer option - the preserve_downgrade_option cluster setting gives users the ability to downgrade before finalizing the upgrade. The only real downside to introducing cluster versions in patch releases is that we've currently taught users that upgrades within patch series are reversible and that preserve_downgrade_option is only needed for major/minor releases. If the bug is severe enough we could override that policy and introduce an irreversible upgrade in a patch release (with a cluster setting).
I'm not sure this bug rises to that level, though. If we can't find a solution that avoids the need for a new cluster version, I think it would be better to disable time-bound iterators in 2.1.x and bring them back in 2.2 with the long-term fix.
What does rocksdb do with keys that are not bit-identical but are equivalent according to the comparator? If an intent is repeatedly added and removed you could have SST files with two metadata tombstones at different timestamps getting merged into one lower-level SST. Would it recognize that only one (and only the newer one) should appear in the new SST?
I'm pretty sure that RocksDB always uses the comparator for comparison and never checks to see if two keys are bit-identical. All keys in RocksDB have a sequence number. If an intent is repeatedly added and removed, only the latest entry (the one with the largest sequence number) will appear in the new SST.
Are you saying that bumping COCKROACHDB_VERSION would be acceptable in 2.1.x but adding a cluster version would not?
No. My point was that any change that involves tweaking the on-disk format in a backward incompatible way will require bumping COCKROACHDB_VERSION. I didn't consider whether that would be acceptable for 2.1.x, but it is almost certainly not, just as bumping the cluster version in a patch release is not acceptable.
I'm pretty sure that RocksDB always uses the comparator for comparison and never checks to see if two keys are bit-identical. All keys in RocksDB have a sequence number. If an intent is repeatedly added and removed, only the latest entry (the one with the largest sequence number) will appear in the new SST.
Is that sufficient? Wouldn't it be possible for an intent at keyA/ts2 to be written and aborted. Then an intent at keyA/ts1 be written and committed. If ts2 > ts1 (which isn't prevented if nothing has ready keyA at >= ts1) then we'd have a problem with information loss if the two tombstones end up in the same sst. The deletion tombstone for the intent at ts1 would have a larger seq num than the deletion tombstone for the intent at ts2 and would wipe it out. We could then see the intent at keyA/ts2 come back to life. This is a case we'll need to be careful with for a few of the proposed solutions.
How would keyA/ts2 come back to life? ISTM if we track time bounds correctly a scan to resolve ts1 will either see both keyA/ts2 and its deletion, or neither, both of which are valid.
The intent at keyA/ts2 could come back to life if a time-bound iterator is used with a lower-bound between ts1 and ts2 and the deletion tombstone for that intent was collapsed into the deletion tombstone for the intent at keyA/ts1 (which could have a larger sequence number!).
So the interleaving would be like this, right?
Delete(keyA) @ seqno4
Put(keyA) -> {intent, ts1) @ seqno3
Delete(keyA) @ seqno2
Put(keyA) -> {intent, ts2} @ seqno1
And you're saying that after a compaction we might end up with this?
Delete(keyA) @ seqno4
Put(keyA) -> {intent, ts2} @ seqno1
Hmm. That does see problematic. It looks like that would be a problem with the delete-with-payload approach, but not with the new timestamp format approach that Peter suggested. It would also be straightforward to resolve this with the merge operator approach: you'd simply record the minimum and maximum timestamps that you'd merged and use those to inform the SSTs bounds.
I'm not sure an interleaving like that (where a series of operations has increasing sequence numbers with decreasing timestamps) is really able to hit this bug (except for the edge case around sst boundaries). Intent resolution takes one of two forms: a committed intent just deletes the metadata; this is the problematic case since there is no timestamp information at the current sequence number. An aborted intent deletes the metadata and the intent value, and the intent value tombstone carries a timestamp.
So if the intent at ts2 was committed, it would be impossible to follow it with an intent at ts1. If it was aborted, there's be a Delete(keyA, ts2) at seqno 2.5 that would cover the problematic intent and give the sst the correct bounds.
So if the intent at ts2 was committed, it would be impossible to follow it with an intent at ts1. If it was aborted, there's be a Delete(keyA, ts2) at seqno 2.5 that would cover the problematic intent and give the sst the correct bounds.
Right, unless we get really unlucky and the two deletions, Delete(keyA) and Delete(keyA @ ts2), end up in separate SSTs. However unlikely that split might be, we definitely don't want to pursue a solution for v2.2 that doesn't solve this edge case.
It seems like we could perhaps use the RocksDB merge operator here to tag a deletion of a metadata key with a timestamp. That is, rather than calling RocksDB->Delete(metadata_key), we'd call RocksDB->Merge(metadata_key, {TYPE_DELETE, ts, real_metadata}). Then the table properties collector would be able to discover the relevant timestamp for the deletion tombstone, and the merge operator would do the work of suppressing the key. That's going to be a backwards compatibility nightmare, though.
Are you proposing extending merge operators so they can perform deletions? I don't think merge operators support something like that right now.
Is it a problem that it doesn't? We would need to think of this merge operator as a delete key, not a single delete key. This means that merging the deletion key with the writen key would need to result in a deletion key again, instead of resulting in nothing at all - except at the bottom of the lsm. Once the deletion key reaches L6, we'd want it to be removed. Does the merge API provide enough hooks for us to enforce this?
Is it a problem that it doesn't? We would need to think of this merge operator as a delete key, not a single delete key. This means that merging the deletion key with the writen key would need to result in a deletion key again, instead of resulting in nothing at all - except at the bottom of the lsm. Once the deletion key reaches L6, we'd want it to be removed. Does the merge API provide enough hooks for us to enforce this?
Kind of. You can tell when it's safe to drop the value because the existing_value provided to the merge operator will be null. The output of a merge operator, however, cannot be null; i.e., you have to produce _some_ value. We have two options here: we can output an empty value and teach the mvccScanner to treat this as "no intent", or we can tweak RocksDB so that merge operators can result in the absence of a value.
We have two options here: we can output an empty value and teach the mvccScanner to treat this as "no intent", or we can tweak RocksDB so that merge operators can result in the absence of a value.
I think we want to do the latter, because we want the intent keys to disappear when they reach the bottom level of the LSM. It is possible this could already be accomplished in RocksDB with a compaction filter, though I'm not sure if that does the right thing with respect to snapshots.
Regarding compaction filters and snapshots:
If there is a snapshot taken later than the key/value pair, the key/value pair cannot be filtered and compaction filter will not be invoked on it.
Ok, here's a quick summary of where we stand. The only two approaches that are capable of solving the hazard that Nathan mentioned are:
(Someone please correct me if I'm wrong.) I prefer the merge operator approach (option 2) to the table filter approach (option 1). The table filter approach is conceptually messy—table filters aren't supposed to take in information besides what's in the SST— and I don't see a clear path to implementation, since level iterators in RocksDB are currently completely independent.
The merge operator approach, by contrast, is conceptually sound and straightforward to implement. Timestamp information stays within the merged key, so there's no need to look across levels when constructing iterators/filteting tables/compacting. And there's something elegant about allowing merges to result in the absence of a value. As far as I can tell the required RocksDB changes will be small.
There's a variant of the merge operator approach where we don't need to make any changes to RocksDB. In this variant, our merge operator outputs the empty key ("") to indicate a psuedo-deletion, which the MVCC scanner would interpret as a proper deletion. Then we'd use a compaction filter to suppress these keys from the bottom level so that they wouldn't take up space forever. But this requires tightly coupling the compaction filter, merge operator, and MVCC scanner, which I'd like to avoid.
@benesch These are the viable options that I'm aware of. One advantage of the table filter approach is that it isn't a format change. Old versions of cockroach would be compatible with data produced by new versions of cockroach. The merge operator approach will require bumping COCKROACH_VERSION and we may want to be careful about doing that to allow users to downgrade before the cluster version is bumped.
The table filter approach is conceptually messy—table filters aren't supposed to take in information besides what's in the SST— and I don't see a clear path to implementation, since level iterators in RocksDB are currently completely independent.
I said this in jest in person: we could stuff the cross-level information into RangeDelAggregator which does propagate data between levels. That is horribly messy and that comment made @benesch look mildly nauseous.
@benesch These are the viable options that I'm aware of. One advantage of the table filter approach is that it isn't a format change. Old versions of cockroach would be compatible with data produced by new versions of cockroach. The merge operator approach will require bumping COCKROACH_VERSION and we may want to be careful about doing that to allow users to downgrade before the cluster version is bumped.
True. But we can tie the bump of COCKROACH_VERSION to the bump of the cluster version, and since we don't allow rolling back a cluster version bump, I don't think we're any worse off.
I said this in jest in person: we could stuff the cross-level information into RangeDelAggregator which does propagate data between levels. That is horribly messy and that comment made @benesch look mildly nauseous.
Now that I think about it: are time-bound iterators and range deletions completely broken? Seems like a time bound iterator might completely miss a range deletion that ends up in a different SST.
Range deletions break MVCC generally (which is why we only use them when tables are completely gone), and we only use time-bound iterators for MVCC data.
But we can tie the bump of COCKROACH_VERSION to the bump of the cluster version, and since we don't allow rolling back a cluster version bump, I don't think we're any worse off.
Yes, that is what we'll have to do. It just requires some additional care in when this new merge operator functionality is used. We'll have to keep the code path which uses a deletion tombstone to delete an intent. Hopefully you'll be able to do this somewhat cleanly.
Range deletions break MVCC generally (which is why we only use them when tables are completely gone), and we only use time-bound iterators for MVCC data.
The problem isn't clear range, but snapshots. @bdarnell and I just chatted offline, and we suspect that a replica getting rebalanced away and back could result in a range tombstone ending up in an SST without appropriate time bounds and invisible to future time-bound scans.
As I've written a test for this, I've discovered that RefreshRange (without my hack) would be incorrect even if we handled timestamp bounds correctly at the rocksdb level. Consider the following sstables:
A transaction at ts3 calls RefreshRange. The first sst matches the table filter; the second doesn't (with or without the fix). If the value at k1 were committed, this would be fine: RefreshRange discards committed values at older timestamps. But if we don't look at sst2, we see the intent as still pending. We can't discard it as "older" because it may have been pushed and commit at a higher timestamp.
Since this issue would be present even with perfect timestamp bounds, the hack is looking less hacky - any use of a time-bound iterator that returns an intent must follow up without a time bound to see if/when the intent was resolved. But more than that, I think this calls time-bound iterators into question generally. Since intents can be pushed, can we ever count on a TBI to give us complete and accurate information?
Offline discussion revealed that there is another important criterion for the safe use of time-bound iterators: Any use of a time-bound iterator with min-timestamp T must have been preceded by a read of the same span at time T which verified that there are no unresolved intents at that timestamp. (and updated the timestamp cache to ensure that no new intents will be written below that point) I think that's true of all current TBI usage, but it's not true in my test, so the scenario above can't occur.
The problem isn't clear range, but snapshots. @bdarnell and I just chatted offline, and we suspect that a replica getting rebalanced away and back could result in a range tombstone ending up in an SST without appropriate time bounds and invisible to future time-bound scans.
Sstables containing range deletions should be rare. Should we clear the timestamp bounds for an sstable whenever it contains a range deletion tombstone?
Sstables containing range deletions should be rare. Should we clear the timestamp bounds for an sstable whenever it contains a range deletion tombstone?
I think that might be our only option, unless we invent some way to stuff a payload into range deletions.
After correcting the misunderstanding in my previous two messages, I'm having a harder time coming up with a real reproduction of the bug. RefreshRange returns an error for both intents and committed values, so losing an MVCCMetadata tombstone for a committed intent shouldn't change anything. I think this means that the adriatic bug is in fact the sstable boundary condition: the intent is aborted, but the mvcc metadata tombstone and the value tombstone ended up in different sstables. Is this consistent with other folk's understanding, or does this seem too implausible and we should keep looking for a way the bug could occur mid-sstable?
Not sure this makes a difference in any of our conclusions, but FWIW I recall our incremental backup testing showing that TBIs _didn't_ make a big difference. For a reasonable incremental backup schedule (1 hour) enough compactions had happened that it didn't move the needle much. This was a very long time ago, though, so I don't remember the details and things might have changed.
However, pretty sure TBIs have a big impact on changefeed performance. They currently work by repeatedly polling small windows of time using ExportRequest, which is the best case scenario for TBI. When we start up a changefeed over a freshly restored tpcc, which isn't as good for TBIs, the ExportRequest latencies do initially spike. IIRC, the RangeFeed performance story depends on them as well.
Any use of a time-bound iterator with min-timestamp T must have been preceded by a read of the same span at time T which verified that there are no unresolved intents at that timestamp.
Huh, that's funny. Incremental backup does this accidentally (you can only do an incremental backup on top of an existing incremental backup).
A changefeed that's requested to start at a certain timestamp (using the cursor= option) does not though. It will happily start sending ExportRequests from the requested timestamp until now(), which under the hood will trigger TBIs with that bound.
After correcting the misunderstanding in my previous two messages, I'm having a harder time coming up with a real reproduction of the bug. RefreshRange returns an error for both intents and committed values, so losing an MVCCMetadata tombstone for a committed intent shouldn't change anything. I think this means that the adriatic bug is in fact the sstable boundary condition: the intent is aborted, but the mvcc metadata tombstone and the value tombstone ended up in different sstables. Is this consistent with other folk's understanding, or does this seem too implausible and we should keep looking for a way the bug could occur mid-sstable?
That's not my understanding. A committed value above this timestamp will cause RefreshRange to throw an error, but a committed value below it won't. The issue here is that an intent at any timestamp will cause the RefreshRange to fail. We see exactly this in the logs. The RefreshRange is refreshing at a timestamp larger than the zombie intent's timestamp.
This brings up an interesting question. Is another hack around this bug to simply ignore intents at timestamps below the intent timestamp? There's no way one should have been added after the initial scan, which is what makes the use of a time-bound iterator safe in the first place. I think it's the case that any intent below our refresh timestamp is necessarily an artifact of this time-bound iterator bug.
A committed value above this timestamp will cause RefreshRange to throw an error, but a committed value below it won't.
But you couldn't get a committed value below this timestamp unless you previously had an intent below it, and we guarantee that there is no intent below the timestamp before we add the span to the refresher.
Oh, but I guess that's what you're saying. We guaranteed that there were no intents on the initial read, so anything that appears to be an intent in the past on the refresh must be an occurrence of this bug. So maybe my test is not unrealistic after all.
We guaranteed that there were no intents on the initial read, so anything that appears to be an intent in the past on the refresh must be an occurrence of this bug.
Yes, exactly. I can't think of a situation where it's possible that an intent below the refresh timestamp isn't a result of this bug (unless there are other bugs...).
FWIW, this is also going to wreak havoc on a rangefeed if the feed hits the issue during its catch-up scan. It will begin tracking the intent and never see its resolution, so its resolved timestamp will stall forever.
Looks like we're hitting this in cdc, too: https://github.com/cockroachdb/cockroach/issues/32104#issuecomment-438893432. ExportRequest gets stuck in a loop waiting for an intent to resolve, but the intent only seems like it's there because of TBIs.
@nvanbenschoten and I sharpened our understanding of this issue tonight. Suppose we do fix this deletion tombstone problem using one of the approaches described above. It turns out TBIs will _still_ be incorrect.
To see why, consider the following two SSTs:
SST 1 [ts1, ts4] SST 2 [ts1, ts1]
Put(/a) -> {intent @ ts1} MagicDel(/a) -> {intent @ ts1}
Put(/a @ ts1)
Put(/b @ ts4)
Notice how SST 2 has the appropriate bounds through some new magic deletion tombstone. Let's say we try to do a MVCC iteration using a time-bound iterator over [t2, t5]. This will be _completely_ wrong: we'll pick up the intent in SST 1 because we happen to overlap its time bounds, but we'll drop the resolution of that intent from SST 2 on the floor. This will result in a write intent error even though we specified we didn't care about data at ts1.
Perhaps this was already known to the other folks in this discussion, but it was definitely surprising to me and Nathan.
My proposed fix, based on an idea by Nathan, is to push enforcement of time bounds into the (C++) MVCC scan codepath. Before the MVCC scanner even considers a key, if it sees that the key's timestamp is outside its underlying iterator's time-bound hints, it throws away the key.
The API will look something like this:
type ScanOptions struct {
MinTimestamp hlc.Timestamp
MaxTimestamp hlc.Timestamp
Txn *roachpb.Transaction
Consistent bool
Tombstones bool
// ...
}
func MVCCScan(start, end roachpb.Key, opts ScanOptions) (results, error)
func MVCCIterate(start, end roachpb.Key, opts ScanOptions, f func (/* ... */)) error
type IterOptions struct {
LowerBound roachpb.Key
UpperBound roachpb.Key
MinTimestampHint hlc.Timestamp
MaxTimestampHint hlc.Timestamp
}
func NewIterator(IterOptions) Iterator
The important point above is that IterOptions continues to take timestamp bounds as hints, while the higher-level scan functions have optional timestamp bounds that, when specified, are not hints but guarantees.
Since naked time-bound iterators are so hard to use correctly, we shouldn't see anyone setting MinTimestampHint/MaxTimestampHint besides the internal MVCC pathways, like MVCCScan/MVCCResolveIntent/the CCL incremental MVCC iterator. Even usages of the higher-level time-bound MVCC operations should be scrutinized, since they come with their own footgun, namely, that you need to have proof that no intents beneath your min timestamp bound could exist.
If this sounds good to folks I'm going to embark on this project tomorrow.
Thanks for the example, @benesch. Yowza, that's broken indeed. With your proposed fix, the operation (say RefreshRange) would still apply the time bounds (and exclude SST2), but would additionally throw away the intent at ts1, right? You're alluding to a refactor that lets RefreshRange not even provide the hints. What would it call instead? Or have we concluded that RefreshRange just doesn't need to use the time bounds?
The general plan SGTM.
while the higher-level scan functions have optional timestamp bounds that, when specified, are not hints but guarantees.
Ah, it would call MVCCScan(..., guaranteedMin, guaranteedMax), right?
@benesch Looks like ccl/storageccl/engineccl.MVCCIncrementalIterator already does the right thing. Or am I misreading that code?
Is there a reason that MVCCIncrementalIterator doesn't use MVCCScan? Seems like the current code is going to be fairly slow due to the many cgo crossings. @dt is this the part of backup that you keep wanting to push into C++?
@benesch Looks like ccl/storageccl/engineccl.MVCCIncrementalIterator already does the right thing. Or am I misreading that code?
This looks busted to me because it doesn't filter out intents with a start time greater than the lower time bound: https://github.com/cockroachdb/cockroach/blob/b5caad2b2678ac861a1e5104237e25a38805f1a3/pkg/ccl/storageccl/engineccl/mvcc.go#L154
Is there a reason that MVCCIncrementalIterator doesn't use MVCCScan? Seems like the current code is going to be fairly slow due to the many cgo crossings. @dt is this the part of backup that you keep wanting to push into C++?
Yes, it is.
Yep, as @benesch said, the body of evalExport and the mvccincremetnaliterator is what we've wanted to try moving to c++. I played with this a little last year but didn't get very far https://github.com/cockroachdb/cockroach/pull/20328/files#diff-77e88bcbe1e6adc1a10301448cc1ec7bR98
Hmm, it appears I'll incidentally be rewriting MVCCIncrementalIterator in C++ shortly.
@benesch Yeah, this is what I was trying to get at in https://github.com/cockroachdb/cockroach/issues/28358#issuecomment-438390174. I think discarding keys below the timestamp is the way to go, but it's kind of scary.
@benesch Yeah, this is what I was trying to get at in #28358 (comment). I think discarding keys below the timestamp is the way to go, but it's kind of scary.
Ah, yeah, I didn't internalize that when you wrote it up.
For the RefreshRange and ResolveIntent usage of timebound iterator, I think we want semantics where if we select an SST for the filter, we also automatically select overlapping SSTs that are at higher levels of the LSM. I think this is the moral equivalent of what the [txn-start, max) proposal is trying to do.
I was thinking about ways of making this approach feasible, but I think I just found a fundamental flaw instead. There's no guarantee that newer versions are in higher levels, right? I'm pretty sure it's just as possible for a newer version of a key to be in a lower level. Since L0 is unsorted you might decide to compact the L0 file with the newer version of the key before the L0 file with the older version of the key. So you'd need to look for overlapping SSTs both upwards and downwards, and that's likely to erase most of the performance benefits.
I'm now of the opinion that using the merge operator is the only viable option we've come up with.
The one other thing that comes to mind is sticking a secret timestamp on intent keys. Something like /key/<secret-intent-timestamp>/<zero-mvcc-timestamp>. Only the table properties collector would look at this timestamp; everything else would ignore it. This seems like a terrifyingly invasive change, but it could probably be made to work.
For the RefreshRange and ResolveIntent usage of timebound iterator, I think we want semantics where if we select an SST for the filter, we also automatically select overlapping SSTs that are at higher levels of the LSM. I think this is the moral equivalent of what the [txn-start, max) proposal is trying to do.
I was thinking about ways of making this approach feasible, but I think I just found a fundamental flaw instead. There's no guarantee that newer versions are in higher levels, right? I'm pretty sure it's just as possible for a newer version of a key to be in a lower level. Since L0 is unsorted you might decide to compact the L0 file with the newer version of the key before the L0 file with the older version of the key. So you'd need to look for overlapping SSTs both upwards and downwards, and that's likely to erase most of the performance benefits.
Eh? I'm not following the scenario you're describing. L0 is unsorted in key order, but it is sorted in sequence number order. While RocksDB doesn't have to compact all of L0 at once, if it does compact an L0 table to L1, it has to compact the oldest L0 table for a user key first. I recall reading somewhere that RocksDB maintains the invariant that for a given user key, newer sequence numbers will either appear in the same level or a higher level. You're talking about timestamps and not sequence numbers, but I'm failing to see how you could get a newer timestamp in an lower level.
I recall reading somewhere that RocksDB maintains the invariant that for a given user key, newer sequence numbers will either appear in the same level or a higher level. You're talking about timestamps and not sequence numbers, but I'm failing to see how you could get a newer timestamp in an lower level.
Do you have a source for that? I've wanted to believe this for a long time and yet have never been able to find proof that that is the case. In fact I feel like we've talked about this before and come to the conclusion that there was no evidence that RocksDB provides this guarantee.
After looking at the code though it seems like all overlapping L0 files are always selected for compaction, which would provide this guarantee. Hmm.
Do you have a source for that? I've wanted to believe this for a long time and yet have never been able to find proof that that is the case. In fact I feel like we've talked about this before and come to the conclusion that there was no evidence that RocksDB provides this guarantee.
Take a look at Version::Get and FilePicker. While iterators use a heap to merge together the levels, DB::Get walks down the levels. That code isn't exactly clear, but it walks down level, and within L0 it walks from newest to oldest table. The comment on FilePicker seems to be talking about this invariant, though it isn't clearly stated:
// Class to help choose the next file to search for the particular key.
// Searches and returns files level by level.
// We can search level-by-level since entries never hop across
// levels. Therefore we are guaranteed that if we find data
// in a smaller level, later levels are irrelevant (unless we
// are MergeInProgress).
Interestingly, that comment is copied from LevelDB's Version::Get. The code in LevelDB is easier to understand.
FYI, I ran some tests last week to see the performance impact of disabling TBIs on changefeeds. Not sure it's workable. https://github.com/cockroachdb/cockroach/issues/32799
Testing the normal iterator workaround seems annoying, but the patch is pretty straightforward. Whenever you see an intent in MVCCIncrementalIterator, look up the intent key with a normal iterator. If it doesn't exist, skip the intent, otherwise use the value from the normal iterator for the rest of the logic in (*MVCCIncrementalIterator).advance. https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/storageccl/engineccl/mvcc.go#L133
Not sure it's workable
That's quite the understatement.
I think it's fair to say this was re-closed by the workaround merged in #32909