Cockroach: sql: rollback to savepoint after schema change can lead to deadlock

Created on 17 Apr 2018  路  30Comments  路  Source: cockroachdb/cockroach

The following deadlocks:

begin; savepoint cockroach_restart; create table t(x int); rollback to savepoint cockroach_restart;
select * from t;

The reason why it deadlocks is that the create left an intent on something (probably a table name record); the intent will never commit because txn's epoch is incremented by the rollback to savepoint, but it still exists. The select then tries to resolve that name in order to get a lease on that table - and that name resolution process uses a different txn which gets blocked on the intent.
Funny enough, if you do another create after the rollback, that works - probably because all reads/writes are done using the main txn in that case.

Found while investigating #24785

One way to fix this is for transactions that perform schema changes to employ a policy that says that they don't get to keep intents from old epochs - or rather, whenever we would bump the epoch, we'd create a new txn instead. This would mean that they don't get to benefit from the "reservation" effect of leaving intents around, but who cares. The moral justification is that these transactions touched "special" keys.
Another way is to do all lease acquisitions and name resolution and such by using the transaction running the statement that caused the lease. But I don't like that at all; it goes against divorcing the table descriptor caching layer from SQL transactions which I think is generally a great thing.

Opinions?

cc @vivekmenezes @bdarnell

A-schema-changes C-bug S-2

Most helpful comment

I don't know what's up with with the delay. Mind filing it separately for Lucy or Werner?

All 30 comments

A similar deadlock happens when you create a table in an open transaction and run some commands that require that schema in other transactions.

Note to self: #24888 added a test that (when ported to 2.0) can be simplified if we fix this issue.

@vivekmenezes that's not a deadlock, is it? That's normal reader blocks for writer behavior.

@andreimatei yeah you're right that's just requests blocking on an uncommitted intent. Please add the test in #24888 to 2.0 so we don't regress on it. Thanks!

@andreimatei this isn't 2.1, is it?

It's not the worst thing in the world. In any case I'll pass it along to @vivekmenezes , thanks for the ping.
Rereading this, I now don't understand why

begin; savepoint cockroach_restart; create table t(x int); rollback to savepoint cockroach_restart;
select * from t;

deadlocks, but

begin; savepoint cockroach_restart; create table t(x int); rollback to savepoint cockroach_restart;
create;

doesn't.

@andreimatei I've not studied this issue. It is my intention to move all schema change writes to commit time, which will also fix https://github.com/cockroachdb/cockroach/issues/20526

I've been holding off on that because of the worry about leases that keep renewing themselves pushing the schema change writer. But I do plan on doing that as soon as epoch based table leases is merged, because epoch based table leases is going to considerably reduce the read demands on the table descriptors

moving this to the backlog. This is still an extant issue

See https://github.com/cockroachdb/cockroach/issues/44385#issuecomment-585454484 for another case of this deadlock.

@jordanlewis I'm not sure organizationally where this work falls but something needs to be done here.

The select then tries to resolve that name in order to get a lease on that table - and that name resolution process uses a different txn which gets blocked on the intent.

Is there any "real" (or important-to-preserve) reason why the name resolution process is using a different txn?

Another way is to do all lease acquisitions and name resolution and such by using the transaction running the statement that caused the lease. But I don't like that at all; it goes against divorcing the table descriptor caching layer from SQL transactions which I think is generally a great thing.

I think Andrei's preferred solution above is reasonable. There are risks of exposing uncommitted DDLs to caches if we use the user's txn.

One way to fix this is for transactions that perform schema changes to employ a policy that says that they don't get to keep intents from old epochs - or rather, whenever we would bump the epoch, we'd create a new txn instead.

It's not just leases for tables. The case in question here is the database cache. I'm not going to lie, the schema caches scare me and I'm hesitant to touch it when there's a cleaner approach.

All I can say is that there used to be problems when the statement's txn was used too :)
To me, it's conceptually much cleaner to have the schema resolution process only have a timestamp as the input, and not a transaction. Passing a transaction just because we need to realize that some writes by this same transaction need to be ignored sounds quite funky.

the schema cache was introduced after #44385 was discovered. however, we did introduce a new schema lookup. it seems scary that when this happens i permanently can't reference that database descriptor again in #44385.

https://github.com/cockroachdb/cockroach/blob/0cdbfd35c7922b9c58417509beda62591eb5a95e/pkg/sql/descriptor.go#L164

however, it is worth noting that since introducing the schema cache, https://github.com/cockroachdb/cockroach/pull/44967 seems to be flaking more often

Hey @andreimatei, would it be possible to make this happen as part of your SAVEPOINT work this release? I talked with @ajwerner, who suggests that the right way to move forward is to have DDL-containing transactions who get their epoch bumped use a new transaction (which functionally drops all of their outstanding intents) in the conn executor state machine.

I talked with @ajwerner, who suggests that the right way to move forward is to have DDL-containing transactions who get their epoch bumped use a new transaction (which functionally drops all of their outstanding intents) in the conn executor state machine.

That's what the original message in this thread suggests too. But how is it related to savepoints? Nice try, but I'd rather let someone else do it. I can review the PR, though.

I thought it was related enough because you are already working heavily in the conn executor state machine - I figured it wouldn't be too much extra trouble to add this too. But maybe I figured wrong.

I played around a bit. There's a couple of weird things going on.
The first, which I find most troubleing, is that we somehow seem to be leaving some lock behind even after we close the connection whose transaction I believe had that lock.
Repro:
client 1:

begin; savepoint a; create table t(); rollback to a;
select * from t; -> blocks

client 2:
select * from t; -> blocks

close client 1. Client 2 is still blocked! Any new client will also block. Only restarting the server seems to fix it. Closing the connection is supposed to rollback the respective txn, and it generally does.
@nvanbenschoten , this couldn't be that lock leak bug you found last night, could it?

The 2nd interesting thing is that depending on exactly how I issue the operations (or their timing?), I don't always get the deadlock.
For example, begin; savepoint a; create table t(); rollback to a; select * from t; deadlocks, but issuing every statement separately doesn't - the final select says "relation not found".

When the thing deadlocks, there's a TableCollection.getTableVersion() which is deadlocked. The stack looks like this

When the final select doesn't deadlock, its trace looks like this:

sql/resolver.go:300  [n1,client=127.0.0.1:47540,user=root] !!! planner.LookupObject: defaultdb.public.t
sql/logical_schema_accessors.go:82  [n1,client=127.0.0.1:47540,user=root] !!! LogicalSchemaAccessor: defaultdb.public.t
sql/physical_schema_accessors.go:310  [n1,client=127.0.0.1:47540,user=root] !!! CachedPhysicalAccessor: defaultdb.public.t
sql/table.go:285  [n1,client=127.0.0.1:47540,user=root] !!! TableCollection.getTableVersion: defaultdb.public.t
sql/physical_schema_accessors.go:176  [n1,client=127.0.0.1:47540,user=root] !!! UncachedPhysicalAccessor: defaultdb.public.t
sql/physical_schema_accessors.go:208  [n1,client=127.0.0.1:47540,user=root] !!! UncachedPhysical couldn't find name: defaultdb.public.t
sql/table.go:287  [n1,client=127.0.0.1:47540,user=root] !!! TableCollection.getTableVersion: defaultdb.public.t - done
sql/resolver.go:143  [n1,client=127.0.0.1:47540,user=root] !!! resolveExistingObjectImpl couldn't find table

I'll understand what's going on.

@nvanbenschoten , this couldn't be that lock leak bug you found last night, could it?

I don't see any evidence that points to that. The lock issue was related to lease transfers, which don't seem to be at play here. Also, I tested this with the fix and could still reproduce. Doesn't it seem more likely that whatever is deadlocking the single transaction on itself is also blocking the other one?

Doesn't it seem more likely that whatever is deadlocking the single transaction on itself is also blocking the other one?

Yes, but my concern is with the rollback that's supposed to happen on connection close. The canceling of the connection is supposed to get out of the deadlock by canceling various contexts, and then everything is supposed to be rolled back - and so other clients are supposed to be unblocked.
I think I've figured this part out, though. I think the crux of the matter is that, if the connection closing happens when the SQL txn is in the Aborted state (in this particular case, the canceling of some context causes a lease acquisition to fail, which moves the txn in the Aborted state), then we no longer properly rollback everything. We're supposed to be doing that by sending a nonRetriableError{IsCommit:True} event, which was supposed to cause a rollback from any state (by virtue of the IsCommit attribute), but we're not doing the right thing in the Aborted state - we're not handling the IsCommit part correctly.

Fixing the connection close bug in #45835

Back to why we sometimes deadlock and sometimes don't, the difference seems to come down to LeaseManager.resolveName() blocking or not blocking on an intent (presumably on system.namespace). I don't understand why it doesn't always block yet.

When we don't deadlock because reading system.namespace doesn't block, the final select tries to resolve the table name twice (using LookupObjectID). The first time it does it through LeaseManager.resolveName(), which uses its own txn. I'm not sure yet why that doesn't block.
The 2nd time it attempts to resolve the name through UncachedPhysicalAccessor. That call is using the select's txn, and I wonder if that's all right. The call is here:
https://github.com/cockroachdb/cockroach/blob/100729320a3c44cc4563d31d1fe753f002d37d8b/pkg/sql/table.go#L380
@otan , could you please tell me what the deal is around here? I thought that the way things work is that once a txn has created some tables, those tables are resolved locally by the tableCollection without needing to go to the store. Is that not the case?

Not quite sure, I haven't done too much with table collections. I can try uncover it tomorrow.

I'm trying to understand

https://github.com/cockroachdb/cockroach/blob/f5fe582dcec086122acbe39f93ca8ca2ef3d0739/pkg/sql/lease.go#L1476-L1485

which seems to be the reason we fall back to

https://github.com/cockroachdb/cockroach/blob/100729320a3c44cc4563d31d1fe753f002d37d8b/pkg/sql/table.go#L376-L381

it seems we have to fallback to uncached accessor (with no txn context) as the p.Tables() makes it look like it's disappeared, but we want to see if it's actually gone. in that case, it does appear as if using the same txn for that is broken. i'm not sure if not using the txn is the behaviour we always want, though - i.e. does it break anything if we use a fresh transaction for this?

Oliver, don't worry about it if you've got nothing to do with this. I thought you had introduced that fallback since I saw your name on the blame. I'll bring it to the attention of @ajwerner when I get back to the office; he seems to want to own this zoo. The fallback is only peripheral to this issue anyway.

I've figured out why the sequence doesn't always deadlock. If the writes to the namespace table happen to run into the tscache, then the respective intents will be written at a pushed timestamp.
Resolving names, and acquiring leases and all that, however, always uses the triggering txn's read timestamps for its reads so, when there's no uncertainty into play, it'll ignore the intent.
When running things by hand it's easy to run into the ts cache due to the 3s closed timestamp target.

I've implemented the fix for the deadlock that we were talking about: having the connExecutor transparently switch the KV txn to a new one in certain cases. It's not much code, but it's kinda unfortunate and there's no great place to put it. It makes me a bit unhappy. I'm now thinking that perhaps it'd be better to deal with this problem in KV and have a feature there that says that locks on a particular span of keys must be released when restarting or rolling back to savepoints. I want to discuss it with my peeps.

interesting, i can still repro https://github.com/cockroachdb/cockroach/issues/44385 pretty easily with this change in -- now i don't even have to run the "SHOW TABLES" query to get the deadlock.

repro:

# screen 1
./cockroach start-single-node --listen-addr=localhost:26257 --insecure
# screen 2
./cockroach sql --watch '1ms' -e 'DROP DATABASE IF EXISTS db0;CREATE DATABASE db0;USE db0;CREATE TABLE t0(c0 INT);' --insecure

edit: nvm they just take a minute now:


CREATE TABLE

Time: 46.067ms

CREATE TABLE

Time: 68.928ms

CREATE TABLE

Time: 43.123171s

CREATE TABLE

Time: 1m0.149811s

CREATE TABLE

Time: 1m0.026241s

CREATE TABLE

Time: 47.636ms

CREATE TABLE

Time: 47.969ms

CREATE TABLE

Time: 59.962868s

I don't know what's up with with the delay. Mind filing it separately for Lucy or Werner?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danhhz picture danhhz  路  3Comments

petermattis picture petermattis  路  4Comments

tim-o picture tim-o  路  3Comments

melskyzy picture melskyzy  路  3Comments

xudongzheng picture xudongzheng  路  3Comments