Scylla: LWT sporadic failure if a non-deterministic function is used to assign partition key

Created on 7 May 2021 · 33Comments · Source: scylladb/scylla

cqlsh> create table ks.test(cartId UUID, marker BOOLEAN STATIC, itemId UUID, count INT, PRIMARY KEY(cartId, itemId));
cqlsh> insert into ks.test(cartId, marker) values(uuid(), true) if not exists;
ServerError: std::get: wrong index for variant

Insert without if not exists works perfectly fine.

bug lwt

Source

haaawk

All 33 comments

FYI @kostja

haaawk on 7 May 2021

I can reproduce it in dev mode, but in debug mode the query succeeds. :thinking:
(edit: I can also reproduce it in debug mode, as long as I run it from the command line; just not in gdb :cry: )

dekimir on 9 May 2021

It only happens when the value inserted is uuid():

cqlsh:ks> insert into ks.test(cartId, marker) values(uuid(), true) if not exists;
ServerError: std::get: wrong index for variant
cqlsh:ks> insert into ks.test(cartId, marker) values(1990460f-64e0-4cc2-b834-e3cfa98741f0, true) if not exists;

 [applied] | cartid | itemid | marker | count
-----------+--------+--------+--------+-------
      True |   null |   null |   null |  null

dekimir on 9 May 2021

I believe this is the std::get that throws, based on asserts I've sprinkled around:
https://github.com/scylladb/scylla/blob/cee4c075d24129788dacb75f9868f1e2a16b02ae/transport/server.cc#L836

dekimir on 12 May 2021

@gleb-cloudius looks like you added the "has to be foreign ptr" comment; would you mind taking a look at this?

dekimir on 12 May 2021

First of all I can confirm that this is the get that throws. But then things looks bad. The logic here is that we have a function that can either return a result or a shard number the query should be processed on (based on the taken). If a shard number is returned the code jumps to that shard and expect result to be returned this time, but instead we again getting redirected to a different shard.
Weird.

gleb-cloudius on 12 May 2021

The problem is that the cql statement is not deterministic and will produce different pk each time it is evaluated. It means after re-executing the statement on the destination shard we may get different pk and request to bonce the request to another shard again.
I honestly do not see any easy solution to that, if at all (short of trying to execute the statement in a loop on the same shard hoping that it will eventually succeed).

gleb-cloudius on 12 May 2021

👍1

If we execute such statements on this shard in a loop we will have uneven data distribution across shards. We may want to execute the statement on destination shard in a loop. But that may have unpredictable performance implication, including performance dropping to 1/(n_shards) of the normal one, since it will take n_shards attempts on average to re-generate a new random value that falls on the same shard.

kostja on 12 May 2021

We could assign each non-deterministic function used in a partition key a dedicated slot in client_info, addressed by an integer. Then, any such function would have to fill that slot when invoked, and use that slot if it's not empty.

kostja on 12 May 2021

That would require mark-up of the expression tree during semantic analysis pass, which we're lacking (both the pass and the markup).

kostja on 12 May 2021

If we execute such statements on this shard in a loop we will have uneven data distribution across shard. We may want to execute the statement on destination shard in a loop.

That what I meant, and yes, this is not the brightest idea.

gleb-cloudius on 12 May 2021

Why must we evaluate the function multiple times? Shouldn't we be able to evaluate it once at the beginning and rewrite the term with the evaluation result?

dekimir on 12 May 2021

We need to pass the evaluation result for processing to a different shard and the current prepared statement is not suitable for that.

gleb-cloudius on 12 May 2021

Why must we evaluate the function multiple times? Shouldn't we be able to evaluate it once at the beginning and rewrite the term with the evaluation result?

Each shard has an own copy of the prepared statement to begin with. It would be a bad idea to rewrite the parsed tree during execution. But if we assign slots during semantical analysis in a deterministic way, it is a possible way to "share" evaluation results via client_info.

kostja on 12 May 2021

I'll keep this in mind as I evolve the AST. One of the goals is certainly to make it easier to share AST across shards.

dekimir on 12 May 2021

I'll keep this in mind as I evolve the AST. One of the goals is certainly to make it easier to share AST across shards.

That was always the right way to do what the hack we have now tries to do. The question is how we fix what we have now.

gleb-cloudius on 12 May 2021

The question is how we fix what we have now.

That would depend on how urgent this is. Can it wait until the AST evolves sufficiently?

dekimir on 12 May 2021

I bet fixing #7216 would make the right solution possible. Maybe I should give it top priority?

dekimir on 12 May 2021

@kostja who can decide about priority of this bug. How common is the usage of uuid() timeuuid() functions like this? Workaround seams simple - generate uuid on a client.

gleb-cloudius on 12 May 2021

Affected built-in functions:

currentTimestamp 
currentDate
currentTime 
currentTimeUUID
UUID

kostja on 12 May 2021

We could assign each non-deterministic function used in a partition key a dedicated slot in client_info, addressed by an integer. Then, any such function would have to fill that slot when invoked, and use that slot if it's not empty.

I do not think client_info is a good place. It is per connection, not per query. May be put it into query options.

gleb-cloudius on 13 May 2021

Workaround: use a parameter marker and generate the partition key on the client.
Overall this CQL is not possible to execute efficient: neither shard nor token aware driver can compute a token for this kind of statement, and thus it almost always run on a different shard.

kostja on 20 May 2021

👍1

Wow, nice bug.

Perhaps we can try to make things deterministic. When the query starts executing, set up state that includes the time (for now()) and the random number seed (perhaps they can be the same thing). When the query fails, return this state in the "bounce to shard" response, and when it is re-executed it will restart with the same state.

Unfortunately, it means adding this state as a parameter to all native functions so they can access and mutate it.

avikivity on 20 May 2021

Looks like my solution duplicates @kostja almost exactly. Maybe we're two copies of a deterministic function.

avikivity on 20 May 2021

Workaround: use a parameter marker and generate the partition key on the client.
Overall this CQL is not possible to execute efficient: neither shard nor token aware driver can compute a token for this kind of statement, and thus it almost always run on a different shard.

This is a good observation. Because of this observation, I wonder if we shouldn't just declare using "uuid()" et al. in the partition key an error and not bother to implement it at all. Because, as you said, asking the server to pick the random partition key will almost always lead to an extra hop that can be avoided if the client picks it. Another problem with the server picking the random partition key is that if the query fails, and the client retries the operation - it will pick a different partition key.

So I think we can just reject operations using these functions as the partition key, adding an explanation in the user message that the client should generate the random number or whatever itself.

nyh on 24 May 2021

I wish I could agree we can reject all work we can't do efficiently, however, users may want to run these statements even despite efficiency implications. E.g. when playing with the database from the command line and parameter markers are unavailable. Or when it's necessary to use the server-side timeuuid, not client-side timeuuid, for partition key.

kostja on 24 May 2021

On Mon, May 24, 2021 at 07:29:02AM -0700, Konstantin Osipov wrote:

I wish I could agree we can reject all work we can't do efficiently, however, users may want to run this statements even despite efficiency implications. E.g. when playing with the database from the command line and parameter markers are unavailable. Or when it's necessary to use the server-side timeuuid, not client-side timeuuid, for partition key.

Nadav touched here something interesting. What if a client uses
timeuuid() for a pk, but the operation fails to reach the quorum (only one
replica gets the write). The client will retry but will get different
key obviously. Later the failed write may be resurrected and there will
be two writes instead of one.

--
Gleb.

gleb-cloudius on 24 May 2021

On Mon, May 24, 2021, 16:48 Gleb Natapov @.*> wrote:

On Mon, May 24, 2021 at 07:29:02AM -0700, Konstantin Osipov wrote:

I wish I could agree we can reject all work we can't do efficiently,
however, users may want to run this statements even despite efficiency
implications. E.g. when playing with the database from the command line and
parameter markers are unavailable. Or when it's necessary to use the
server-side timeuuid, not client-side timeuuid, for partition key.

Nadav touched here something interesting. What if a client uses
timeuuid() for a pk, but the operation fails to reach the quorum (only one
replica gets the write). The client will retry but will get different
key obviously. Later the failed write may be resurrected and there will
be two writes instead of one.

That's true. Client side generated uuids and timeuuids are the only sane
way to do retries.

--
Gleb.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/8604#issuecomment-847093075,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKADLKQ4UXKPCV6AHJKACDTPJRLLANCNFSM44JDXHSA
.

haaawk on 25 May 2021

This query is a bad idea, but bad ideas should be caught before they are made into public, documented, stable APIs. Not after.

avikivity on 25 May 2021

But it highlights something we'll have later in multi-partition queries.

avikivity on 25 May 2021

This query is a bad idea, but bad ideas should be caught before they are made into public, documented, stable APIs. Not after.

After I get the first "safe mode" restriction in, this can also become one of these safe mode restrictions - we will disable this "problematic" query unless the user actively disables this restriction.

nyh on 26 May 2021

Makes sense. Though I expect it's not a common mistake.

avikivity on 27 May 2021

Shlomi suggests we retry the random value until it fits. @avikivity would you be OK with that suggestion?

kostja on 2 Jun 2021

Was this page helpful?

0 / 5 - 0 ratings