cqlsh> create table ks.test(cartId UUID, marker BOOLEAN STATIC, itemId UUID, count INT, PRIMARY KEY(cartId, itemId));
cqlsh> insert into ks.test(cartId, marker) values(uuid(), true) if not exists;
ServerError: std::get: wrong index for variant
Insert without if not exists works perfectly fine.
FYI @kostja
I can reproduce it in dev mode, but in debug mode the query succeeds. :thinking:
(edit: I can also reproduce it in debug mode, as long as I run it from the command line; just not in gdb :cry: )
It only happens when the value inserted is uuid():
cqlsh:ks> insert into ks.test(cartId, marker) values(uuid(), true) if not exists;
ServerError: std::get: wrong index for variant
cqlsh:ks> insert into ks.test(cartId, marker) values(1990460f-64e0-4cc2-b834-e3cfa98741f0, true) if not exists;
[applied] | cartid | itemid | marker | count
-----------+--------+--------+--------+-------
True | null | null | null | null
I believe this is the std::get that throws, based on asserts I've sprinkled around:
https://github.com/scylladb/scylla/blob/cee4c075d24129788dacb75f9868f1e2a16b02ae/transport/server.cc#L836
@gleb-cloudius looks like you added the "has to be foreign ptr" comment; would you mind taking a look at this?
First of all I can confirm that this is the get that throws. But then things looks bad. The logic here is that we have a function that can either return a result or a shard number the query should be processed on (based on the taken). If a shard number is returned the code jumps to that shard and expect result to be returned this time, but instead we again getting redirected to a different shard.
Weird.
The problem is that the cql statement is not deterministic and will produce different pk each time it is evaluated. It means after re-executing the statement on the destination shard we may get different pk and request to bonce the request to another shard again.
I honestly do not see any easy solution to that, if at all (short of trying to execute the statement in a loop on the same shard hoping that it will eventually succeed).
If we execute such statements on this shard in a loop we will have uneven data distribution across shards. We may want to execute the statement on destination shard in a loop. But that may have unpredictable performance implication, including performance dropping to 1/(n_shards) of the normal one, since it will take n_shards attempts on average to re-generate a new random value that falls on the same shard.
We could assign each non-deterministic function used in a partition key a dedicated slot in client_info, addressed by an integer. Then, any such function would have to fill that slot when invoked, and use that slot if it's not empty.
That would require mark-up of the expression tree during semantic analysis pass, which we're lacking (both the pass and the markup).
If we execute such statements on this shard in a loop we will have uneven data distribution across shard. We may want to execute the statement on destination shard in a loop.
That what I meant, and yes, this is not the brightest idea.
Why must we evaluate the function multiple times? Shouldn't we be able to evaluate it once at the beginning and rewrite the term with the evaluation result?
We need to pass the evaluation result for processing to a different shard and the current prepared statement is not suitable for that.
Why must we evaluate the function multiple times? Shouldn't we be able to evaluate it once at the beginning and rewrite the term with the evaluation result?
Each shard has an own copy of the prepared statement to begin with. It would be a bad idea to rewrite the parsed tree during execution. But if we assign slots during semantical analysis in a deterministic way, it is a possible way to "share" evaluation results via client_info.
I'll keep this in mind as I evolve the AST. One of the goals is certainly to make it easier to share AST across shards.
I'll keep this in mind as I evolve the AST. One of the goals is certainly to make it easier to share AST across shards.
That was always the right way to do what the hack we have now tries to do. The question is how we fix what we have now.
The question is how we fix what we have now.
That would depend on how urgent this is. Can it wait until the AST evolves sufficiently?
I bet fixing #7216 would make the right solution possible. Maybe I should give it top priority?
@kostja who can decide about priority of this bug. How common is the usage of uuid() timeuuid() functions like this? Workaround seams simple - generate uuid on a client.
Affected built-in functions:
currentTimestamp
currentDate
currentTime
currentTimeUUID
UUID
We could assign each non-deterministic function used in a partition key a dedicated slot in client_info, addressed by an integer. Then, any such function would have to fill that slot when invoked, and use that slot if it's not empty.
I do not think client_info is a good place. It is per connection, not per query. May be put it into query options.
Workaround: use a parameter marker and generate the partition key on the client.
Overall this CQL is not possible to execute efficient: neither shard nor token aware driver can compute a token for this kind of statement, and thus it almost always run on a different shard.
Wow, nice bug.
Perhaps we can try to make things deterministic. When the query starts executing, set up state that includes the time (for now()) and the random number seed (perhaps they can be the same thing). When the query fails, return this state in the "bounce to shard" response, and when it is re-executed it will restart with the same state.
Unfortunately, it means adding this state as a parameter to all native functions so they can access and mutate it.
Looks like my solution duplicates @kostja almost exactly. Maybe we're two copies of a deterministic function.
Workaround: use a parameter marker and generate the partition key on the client.
Overall this CQL is not possible to execute efficient: neither shard nor token aware driver can compute a token for this kind of statement, and thus it almost always run on a different shard.
This is a good observation. Because of this observation, I wonder if we shouldn't just declare using "uuid()" et al. in the partition key an error and not bother to implement it at all. Because, as you said, asking the server to pick the random partition key will almost always lead to an extra hop that can be avoided if the client picks it. Another problem with the server picking the random partition key is that if the query fails, and the client retries the operation - it will pick a different partition key.
So I think we can just reject operations using these functions as the partition key, adding an explanation in the user message that the client should generate the random number or whatever itself.
I wish I could agree we can reject all work we can't do efficiently, however, users may want to run these statements even despite efficiency implications. E.g. when playing with the database from the command line and parameter markers are unavailable. Or when it's necessary to use the server-side timeuuid, not client-side timeuuid, for partition key.
On Mon, May 24, 2021 at 07:29:02AM -0700, Konstantin Osipov wrote:
I wish I could agree we can reject all work we can't do efficiently, however, users may want to run this statements even despite efficiency implications. E.g. when playing with the database from the command line and parameter markers are unavailable. Or when it's necessary to use the server-side timeuuid, not client-side timeuuid, for partition key.
Nadav touched here something interesting. What if a client uses
timeuuid() for a pk, but the operation fails to reach the quorum (only one
replica gets the write). The client will retry but will get different
key obviously. Later the failed write may be resurrected and there will
be two writes instead of one.
--
Gleb.
On Mon, May 24, 2021, 16:48 Gleb Natapov @.*> wrote:
On Mon, May 24, 2021 at 07:29:02AM -0700, Konstantin Osipov wrote:
I wish I could agree we can reject all work we can't do efficiently,
however, users may want to run this statements even despite efficiency
implications. E.g. when playing with the database from the command line and
parameter markers are unavailable. Or when it's necessary to use the
server-side timeuuid, not client-side timeuuid, for partition key.Nadav touched here something interesting. What if a client uses
timeuuid() for a pk, but the operation fails to reach the quorum (only one
replica gets the write). The client will retry but will get different
key obviously. Later the failed write may be resurrected and there will
be two writes instead of one.
That's true. Client side generated uuids and timeuuids are the only sane
way to do retries.
--
Gleb.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/8604#issuecomment-847093075,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKADLKQ4UXKPCV6AHJKACDTPJRLLANCNFSM44JDXHSA
.
This query is a bad idea, but bad ideas should be caught before they are made into public, documented, stable APIs. Not after.
But it highlights something we'll have later in multi-partition queries.
This query is a bad idea, but bad ideas should be caught before they are made into public, documented, stable APIs. Not after.
After I get the first "safe mode" restriction in, this can also become one of these safe mode restrictions - we will disable this "problematic" query unless the user actively disables this restriction.
Makes sense. Though I expect it's not a common mistake.
Shlomi suggests we retry the random value until it fits. @avikivity would you be OK with that suggestion?