Scylla: Alternator Streams thinks that PutItem is a REMOVE+MODIFY

Created on 27 Jul 2020 · 20Comments · Source: scylladb/scylla

This issue is similar to issue #6918 (perhaps it should be included into that issue? but I'm not sure), but even more surprising:
It turns out that every PutItem operation appears today in the Alternator Stream as two events - a REMOVE and a MODIFY! This is instead of just one event - a MODIFY or an INSERT (depending on whether the item existed before - see #6918).

This "REMOVE" event appears because our implementation of PutItem inserts a row tombstone with timestamp ts-1 together with the new value with timestamp ts. We need this tombstone because the row is supposed to completely replace the old row - not be merged into it.

I'm not sure how we can or should solve this. CDC already gives special attention to tombstones preceeding a value's timestamp by exactly one, maybe this situation can be recognized already by CDC and the "REMOVE" event avoided from CDC in the lower level. But if we can't do that, at least the Alternator layer should notice this situation - a REMOVE followed by a MODIFY with a difference of one in timestamp, and hide the REMOVE event from its output.

CC @elcallio

Alternator AlternatorStreams

Source

nyh

All 20 comments

On paper it looks like it would be solvable in get_records, but there are some caveats, all due to selection windows and result limit. Since the tombstone marking the "it's a new row" is ts-1, it would be perfectly possible to get a query window that includes the tombstone, but not the actual insert. So, to handle this we would pretty much be forced to never always skip a trailing REMOVE and adjust the next query window to see if it fit with an insert on remove ts+1.
But this would instead somewhat break cases where we have a lot of actual deletes in a row, in which case we would be constantly staggering. Perhaps manageable, but not nice.

The same more or less applied, but much more easily handled, for the query limit.
It strikes me that the pattern "tombstone at ts-1" is common enough that we should maybe have it formalized and recognized in cdc, and generate something identifiable for it. It is after all prevalent for all non-atomics etc as well. Maybe coalesce such a pattern into a cdc_op::replace? @kbr, @haaawk - thoughts?

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

elcallio on 19 Aug 2020

On paper it looks like it would be solvable in get_records, but there are some caveats, all due to selection windows and result limit. Since the tombstone marking the "it's a new row" is ts-1, it would be perfectly possible to get a query window that includes the tombstone, but not the actual insert. So, to handle this we would pretty much be forced to never always skip a trailing REMOVE and adjust the next query window to see if it fit with an insert on remove ts+1.
But this would instead somewhat break cases where we have a lot of actual deletes in a row, in which case we would be constantly staggering. Perhaps manageable, but not nice.

The same more or less applied, but much more easily handled, for the query limit.
It strikes me that the pattern "tombstone at ts-1" is common enough that we should maybe have it formalized and recognized in cdc, and generate something identifiable for it. It is after all prevalent for all non-atomics etc as well. Maybe coalesce such a pattern into a cdc_op::replace? @kbr, @haaawk - thoughts?

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

At some point we were thinking about marking this tombstone with ts instead of ts-1. This works fine with CQL because update SET x = null using timestamp T actually uses T-1. I don't remember whether we did this in the end or not. @kbr- do you remember? I think we did.

haaawk on 19 Aug 2020

This should happen only for collections, because that's where the T-1 tombstone appears. The OP says this happens for every PutItem. Does PutItem refer to collections only?
Actually this shouldn't happen for non-TTLed operations even with collections. Today if we replace a collection without a TTL, we show it as a single delta entry in the CDC log, we don't split anymore.

The issue is reported 23 days ago, @nyh can you confirm this is still happening?

kbr- on 19 Aug 2020

The test still xfails (this is why I love test-first development).

test/alternator/run --runxfail test_streams.py::test_streams_putitem_keys_only

nyh on 19 Aug 2020

@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277

elcallio on 19 Aug 2020

@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277

Right, I'll explain. When Alternator wishes to replace an item completely (not merge some new attributes with old attributes) it needs to delete the old item and then write the new item. If a write and delete are done with the same timestamp the delete wins, which is why we must do the delete at ts-1. This is not a new Alternator invention - Scylla already does exactly the same thing when replacing a collection, it puts a range tombstone for the collection at timestamp ts-1 and the new collection cells with timestamp ts=1.

nyh on 19 Aug 2020

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

This is a good question. If the item did exist before, replacing it is indeed equivalent to a REMOVE followed by an INSERT (See #6918 on the separate issues of an item which didn't exist before, and on INSERT vs MODIFY). I think we need to learn more about how real users use DynamoDB streams, and how much they might care about this issue and about #6918.

So maybe these issues should be considered very low priority. On the other hand, it would have been nicer to have perfect DynamoDB compatibility, without making excuses on why we chose not to be compatible.

nyh on 19 Aug 2020

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Unless:

you have multiple columns in the schema
you want to write a completely new row that is empty for all columns except one column with some new value.

So if I understand correctly

those alternator "items" are rows composed of multiple cells which represent attributes,
changing a single attribute is adding a single cell
replacing an item is putting a row tombstone and adding a cell

Okay, if you're asking me whether we should give special treatment to mutations which have a row tombstone with timestamp T-1 and a cell with timestamp T, my answer is: I don't know, but I'd probably refrain from doing that.

First, it would also affect updates coming from CQL. So if someone in CQL creates a batch as follows:

begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;

they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling.

Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into.

I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments.

kbr- on 19 Aug 2020

Instead of just setting ts-1 for tombstone, we could for internal purposes mark it in such a way that it is recognized as not being user generated. For all the cases in scylla, i.e. collections etc. It should effectively make cdc easier.
Not sure how such a mark would look, if we have any status bits or anything to play with, but it would extend the expressiveness of our internal manipulations so we can detect what we mean, not what we do.

elcallio on 19 Aug 2020

👍1

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Because we want to replace the entire row - not a single cell.
Imagine that no matter what this row previously contained you want to set "a=3". Even if it had previously "b=4", you need to remove it as well. Not merge and have both "a=3" and "b=4".

As I said, Scylla also has exactly the same need in a CQL operation which wants to set - i.e., replace - a collection (e.g., a set). It also needs to delete the previous contents of the collection (this is a range tombstone) and write new contents. In that existing code originating in Cassandra, Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we did exactly the same when replacing an item.

Unless:

* you have multiple columns in the schema
* you want to write a completely new row that is empty for all columns except one column with some new value.

Exactly, but 1. It doesn't have to be "except one column", and 2. Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

First, it would also affect updates coming from CQL.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

So if someone in CQL creates a batch as follows:
begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;
they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling.

Right. Note that in Alternator we assume the user cannot choose the timestamp on their own. So a user cannot manufacture this case.

Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into.

I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments.

Yes, sounds like a reasonable approach, but might not be easy because of the time window issues that Calle described.

nyh on 19 Aug 2020

On Wed, Aug 19, 2020 at 3:01 PM nyh notifications@github.com wrote:

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Because we want to replace the entire row - not a single cell.
Imagine that no matter what this row previously contained you want to set
"a=3". Even if it had previously "b=4", you need to remove it as well. Not
merge and have both "a=3" and "b=4".

As I said, Scylla also has exactly the same need in a CQL operation which
wants to set - i.e., replace - a collection (e.g., a set). It also needs to
delete the previous contents of the collection (this is a range tombstone)
and write new contents. In that existing code originating in Cassandra,
Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we
did exactly the same when replacing an item.

Unless:

you have multiple columns in the schema

you want to write a completely new row that is empty for all columns except one column with some new value.

Exactly, but 1. It doesn't have to be "except one column", and 2. Some of
the columns are collections (in Alternator, we have one main map
column) and the entire collection must be clearer - we don't know which
specific items we need to delete, and we need a tombstone.

First, it would also affect updates coming from CQL.

I suggest you please add to your tests a CQL test which replaces a
collection with a new value (there's a CQL syntax for this!). In the CDC
log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp
TS. Think if that is what you want to see - or not.

So if someone in CQL creates a batch as follows:

begin batchdelete from ks.t using timestamp 42 where pk = 0 and ck = 0;update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;

they would get a single entry in CDC (instead of two) with a completely
new type of operation due to our special handling.

Right. Note that in Alternator we assume the user cannot choose the
timestamp on their own. So a user cannot manufacture this case.

Second, CDC code is complex already. Believe me, you don't want to have
any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has
special handling for the collection tombstone case. If now we mix-in
Alternator-frontend-specific logic to the code, we're going to make a mess.
And I'm afraid the problem discussed in this thread is not the first
problem we'll run into.

I think we should have a translation layer between CDC log table and
Alternator streams which detects such Alternator-frontend-specific stuff
(maybe before even considering query windows etc. so we don't run into
problems such as the one Calle described) and makes the necessary
adjustments.

Yes, sounds like a reasonable approach, but might not be easy because of
the time window issues that Calle described.

+1 to Kamil's comment. This is Alternator stream specific and better not to
inject it into core CDC code.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/6930#issuecomment-676317968,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKADLO3IUZAHWF4HHSRUBDSBPEMHANCNFSM4PJKHF4A
.

haaawk on 19 Aug 2020

@haaawk we already do ts-1 _detection_ in cdc code, to deal with collections tombstones and avoid generating to many cdc rows. I don't think it unreasonable that even this code would benefit from the ts-1 pattern being marked as "generated by internal code" and thus treated explicitly. In fact, it would be safer.
Again, I am not advocating we add more assumptions to cdc. I am advocating that the ts-1 tombstone patterns needs to be better identifiable as "created by scylla code to emulate certain ops" (like replacing a collection, or row). It's right now that we are making assumptions on just timestamps, which is generally bad. Because even in alternator, where user timestamps does not exist, it would still be perfectly possible to generate a delete + put with exactly one timestamp tick between them, just not very likely.

elcallio on 19 Aug 2020

👍1

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

No, I'll see one entry with one timestamp. We changed it some time ago (https://github.com/scylladb/scylla/pull/6491).

cqlsh> create keyspace ks with replication = {'class':'SimpleStrategy', 'replication_factor': 1};
cqlsh> create table ks.t (pk int, ck int, v map<int, int>, primary key (pk, ck)) with cdc = {'enabled': true};
cqlsh> update ks.t set v = {0:0} where pk = 0 and ck = 0;
cqlsh> select "cdc$time", v, "cdc$deleted_v", "cdc$deleted_elements_v" from ks.t_scylla_cdc_log;

 cdc$time                             | v      | cdc$deleted_v | cdc$deleted_elements_v
--------------------------------------+--------+---------------+------------------------
 5f1930ec-e223-11ea-1385-76387c3931ec | {0: 0} |          True |                   null

That's the special collection handling that I mentioned in my previous post.

kbr- on 19 Aug 2020

👍1

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

This is a very good observation, which I already made for other reasons in https://github.com/scylladb/scylla/issues/6918#issuecomment-664237567. Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

No, I'll see one entry with one timestamp. We changed it some time ago (#6491).

Oh good, I didn't remember this. So, any particular reason why you changed this specifically just for collection tombstones and not for the similar issue of entire-row tombstones? I guess the main reason is that CQL only does the former, and doesn't have syntax for the latter (for overwriting an entire item)?

nyh on 19 Aug 2020

@elcallio

we already do ts-1 detection in cdc code, to deal with collections tombstones and avoid generating to many cdc rows

This is done by simply adding 1 to every collection tombstone's timestamp.
The "moral justification" why we can/should do that is the following: if X is a non-frozen column, then the statement
update ... using timestamp T set X = V ...
creates a T-1 timestamp collection tombstone. So we show it in CDC log under T - 1 + 1 = T timestamp.

So we made a decision based on CQL specifics, because CDC was designed with CQL in mind.

EDIT: @nyh this should answer your last question

I don't think we should do such a thing for row tombstones, because
delete from ... using timestamp T where ...
creates a timestamp T row tombstone; showing it under timestamp T+1 in the CDC log would be very confusing.

kbr- on 19 Aug 2020

@kbr - I agree. We should in fact not do any of these things. The tombstones should be marked as "artificial", "created internally" to express something. Then we could safely instead generate more expressive cdc info for whatever is happening.

elcallio on 19 Aug 2020

Today, CDC takes a base table mutation and returns a log table mutation. It's very simple.

What you're proposing would require a significant redesign: we would have to take additional data as input.
It would require changes in other places in the code, not just CDC. For example, we would have to plumb down the information about an "internally generated" collection tombstone all the way from CQL layer down to storage_proxy together with the mutation.

kbr- on 19 Aug 2020

Yes. And I think it might be worth considering doing so. Having more explicit track-keeping of what we are doing is not a bad thing.

elcallio on 19 Aug 2020

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

This is a very good observation, which I already made for other reasons in #6918 (comment). Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.

It seems to be the best solution here to just use cell/collection tombstones instead of row tombstones. That would solve the issue and not add any complexity to the CDC itself.

haaawk on 24 Aug 2020

Its clear this is a difference - and we except it till we find there is an issue with the approach we have taken - pushing this out till we get user feedback