Cockroach: sql: v20.1.0: read table descriptor without ModificationTime with zero MVCC timestamp

Created on 19 May 2020  路  17Comments  路  Source: cockroachdb/cockroach

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/1675228652/?referrer=webhooks_plugin

Panic message:

structured.go:3478: read table descriptor for %q (%d.%d) without ModificationTime with zero MVCC timestamp | string; sqlbase.ID; sqlbase.ID


Stacktrace (expand for inline code snippets):

https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L3477-L3479 in pkg/sql/sqlbase.(Descriptor).Table
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L407-L409 in pkg/sql/sqlbase.getTableDescFromIDRaw
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L988-L990 in pkg/sql/sqlbase.maybeUpgradeForeignKeyRepOnIndex
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L955-L957 in pkg/sql/sqlbase.(
TableDescriptor).MaybeUpgradeForeignKeyRepresentation
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L892-L894 in pkg/sql/sqlbase.(TableDescriptor).MaybeFillInDescriptor
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L365-L367 in pkg/sql/sqlbase.GetTableDescFromID
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/sqlbase/structured.go#L421-L423 in pkg/sql/sqlbase.GetMutableTableDescFromID
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/table.go#L480-L482 in pkg/sql.(
TableCollection).getMutableTableVersionByID
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/drop_table.go#L489-L491 in pkg/sql.(planner).removeFKBackReference
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/drop_table.go#L251-L253 in pkg/sql.(
planner).dropTableImpl
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/drop_database.go#L196-L198 in pkg/sql.(dropDatabaseNode).startExec
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/plan.go#L394-L396 in pkg/sql.startExec.func2
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/walk.go#L143-L145 in pkg/sql.(
planVisitor).visitInternal.func1
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/walk.go#L720-L722 in pkg/sql.(planVisitor).visitInternal
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/walk.go#L110-L112 in pkg/sql.(
planVisitor).visit
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/walk.go#L74-L76 in pkg/sql.walkPlan
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/plan.go#L397-L399 in pkg/sql.startExec
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/plan_node_to_row_source.go#L116-L118 in pkg/sql.(planNodeToRowSource).Start
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/execinfra/processorsbase.go#L747-L749 in pkg/sql/execinfra.(
ProcessorBase).Run
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/flowinfra/flow.go#L369-L371 in pkg/sql/flowinfra.(FlowBase).Run
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/distsql_running.go#L404-L406 in pkg/sql.(
DistSQLPlanner).Run
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/distsql_running.go#L1018-L1020 in pkg/sql.(DistSQLPlanner).PlanAndRun
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor_exec.go#L882-L884 in pkg/sql.(
connExecutor).execWithDistSQLEngine
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor_exec.go#L775-L777 in pkg/sql.(connExecutor).dispatchToExecutionEngine
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor_exec.go#L480-L482 in pkg/sql.(
connExecutor).execStmtInOpenState
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor_exec.go#L95-L97 in pkg/sql.(connExecutor).execStmt
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor.go#L1367-L1369 in pkg/sql.(
connExecutor).execCmd
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor.go#L1296-L1298 in pkg/sql.(connExecutor).run
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/conn_executor.go#L475-L477 in pkg/sql.(
Server).ServeConn
https://github.com/cockroachdb/cockroach/blob/9d456b9ec82cbf9a740a092c0d9f56da48779689/pkg/sql/pgwire/conn.go#L589-L591 in pkg/sql/pgwire.(*conn).processCommandsAsync.func1

pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.(*Descriptor).Table at line 3478
pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.getTableDescFromIDRaw at line 408
pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.maybeUpgradeForeignKeyRepOnIndex at line 989
pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.(*TableDescriptor).MaybeUpgradeForeignKeyRepresentation at line 956
pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.(*TableDescriptor).MaybeFillInDescriptor at line 893
pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.GetTableDescFromID at line 366
pkg/sql/sqlbase/structured.go in pkg/sql/sqlbase.GetMutableTableDescFromID at line 422
pkg/sql/table.go in pkg/sql.(*TableCollection).getMutableTableVersionByID at line 481
pkg/sql/drop_table.go in pkg/sql.(*planner).removeFKBackReference at line 490
pkg/sql/drop_table.go in pkg/sql.(*planner).dropTableImpl at line 252
pkg/sql/drop_database.go in pkg/sql.(*dropDatabaseNode).startExec at line 197
pkg/sql/plan.go in pkg/sql.startExec.func2 at line 395
pkg/sql/walk.go in pkg/sql.(*planVisitor).visitInternal.func1 at line 144
pkg/sql/walk.go in pkg/sql.(*planVisitor).visitInternal at line 721
pkg/sql/walk.go in pkg/sql.(*planVisitor).visit at line 111
pkg/sql/walk.go in pkg/sql.walkPlan at line 75
pkg/sql/plan.go in pkg/sql.startExec at line 398
pkg/sql/plan_node_to_row_source.go in pkg/sql.(*planNodeToRowSource).Start at line 117
pkg/sql/execinfra/processorsbase.go in pkg/sql/execinfra.(*ProcessorBase).Run at line 748
pkg/sql/flowinfra/flow.go in pkg/sql/flowinfra.(*FlowBase).Run at line 370
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).Run at line 405
pkg/sql/distsql_running.go in pkg/sql.(*DistSQLPlanner).PlanAndRun at line 1019
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execWithDistSQLEngine at line 883
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).dispatchToExecutionEngine at line 776
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmtInOpenState at line 481
pkg/sql/conn_executor_exec.go in pkg/sql.(*connExecutor).execStmt at line 96
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).execCmd at line 1368
pkg/sql/conn_executor.go in pkg/sql.(*connExecutor).run at line 1297
pkg/sql/conn_executor.go in pkg/sql.(*Server).ServeConn at line 476
pkg/sql/pgwire/conn.go in pkg/sql/pgwire.(*conn).processCommandsAsync.func1 at line 590

| Tag | Value |
|---|---|
| Cockroach Release | v20.1.0 |
| Cockroach SHA: | 9d456b9ec82cbf9a740a092c0d9f56da48779689 |
| Platform | darwin amd64 |
| Distribution | CCL |
| Environment | v20.1.0 |
| Command | server |
| Go Version | go1.13.9|
| # of CPUs |8 |
| # of Goroutines |271 |

C-bug O-sentry

All 17 comments

cc @lucy-zhang

@ajwerner do you know how this might be possible? This is while fetching a referenced table while migrating the FK representation for a table, though I don't see how that would really matter.

I have literally no clue how this could happen. It only seems possible if the returned KV pair returned a zero-value hlc timestamp. This code path always uses a transaction.

In effect, it seems we called GetProtoTs and got a zero-valued hlc.Timestamp and a descriptor that had a non-nil Table in its union. There's something fishy here.

Also worth noting that this is a mac. We should probably log.Safe some of these arguments.

One other thing of note is that this is the same cluster that hit https://github.com/cockroachdb/cockroach/issues/48786 - it'd be interesting to know what is going on in terms of the state of that cluster.

@daylerees @davidsbond @rcrowe any chance you observed a panic around this time? I'd love to understand what happened here as it certainly violates my expectations.

Also surprising to me is that these seem to be the same clusters but the original issue was using a linux binary and this one seems to be using a macOS binary.

Hi all, our database cluster that was affected by this is using the standard docker image distribution, not 100% sure how we'd end up with a macOS binary. I'm just checking our DB logs to see if it is our cluster

Grepping through the logs of the cluster that was affected is returning no results. We also restored that database from backup as was suggested

Is it possible that while we had the same error the sentry report came from a different cluster to our own?

We seem to finally have a lead on this bug. The problem is that getTableDescFromIDRaw, which is called for every table with a FK relationship to the original table in MaybeUpgradeForeignKeyRepresentation, always reads table descriptors from the store. If the other table descriptors have already been modified in the same transaction, we'll read the intent, and the timestamp returned from GetProtoTs will be zero, violating the assumptions of MaybeSetModificationTimeFromMVCCTimestamp.

A potential similar problem occurs with TableDescriptor.Validate(), which needs to fetch other table descriptors to validate cross-references.

The current thinking is that we should stop reading these descriptors directly and instead always go through a desc.Collection. Details still need to be worked out.

In the short term, the fix which we can backport is to not enforce this invariant for tables retrieved just for the purpose of this upgrade.

We seem to finally have a lead on this bug. The problem is that getTableDescFromIDRaw, which is called for every table with a FK relationship to the original table in MaybeUpgradeForeignKeyRepresentation, always reads table descriptors from the store. If the other table descriptors have already been modified in the same transaction, we'll read the intent, and the timestamp returned from GetProtoTs will be zero, violating the assumptions of MaybeSetModificationTimeFromMVCCTimestamp.

Something occurred to me that this came up at some point, esp in reference to the changes to expose MVCC timestamps to sql. While testing the proposed fix I figured it might be worth ensuring that the concerning behavior actually happens. I wrote this test and it does not find this zero-timestamp behavior. I ran it both on master and release-20.1. I think we still need to understand where the zero comes from.

// TestGetIntentTimestamp tests that a read against an intent returns the value
// with a non-zero timestamp.
func TestGetIntentTimestamp(t *testing.T) {
    defer leaktest.AfterTest(t)()
    s, _, db := serverutils.StartServer(t, base.TestServerArgs{Insecure: true})
    defer s.Stopper().Stop(context.Background())
    ctx := context.Background()

    require.NoError(t, db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
        const key = "aa"
        // Using ptpb.Record as a simple proto message.
        ptRec := ptpb.Record{ID: uuid.MakeV4()}
        if err := txn.Put(context.Background(), key, &ptRec); err != nil {
            t.Fatal(err)
        }
        // Let's ensure that when we read our intent, we read it at a non-zero
        // timestamp.
        var read ptpb.Record
        ts, err := txn.GetProtoTs(ctx, key, &read)
        if err != nil {
            return err
        }
        require.NotZero(t, ts)
        require.EqualValues(t, ptRec, read)
        return nil
    }))
}
$ go test ./pkg/kv/ --run TestGetIntentTimestamp
ok      github.com/cockroachdb/cockroach/pkg/kv 0.093s

It's worth noting that this behavior also is not good. It means we'll populate descriptors with ModificationTime as the provisional commit timestamp rather than the real commit timestamp. In practice this is probably fine because we're not ever handing these descriptors out for leases.

Here's a contrived way to reproduce a zero timestamp, which still seems not too dissimilar to the MaybeUpgradeForeignKeyRepresentation case. One interesting thing is that this doesn't always reproduce the panic; it usually does, but sometimes it succeeds without an error. I couldn't repro a success with -vmodule=replica_evaluate=2, though.

Replace the call to ValidateTable() with Validate() in
https://github.com/cockroachdb/cockroach/blob/cb0ed20357024dabfd070487fb442279cbbcf453/pkg/sql/catalog/catalogkv/catalogkv.go#L163
to start reading referenced descriptors in GetMutableDescriptorByID. Then the following usually (but not always) causes a panic due to reading a zero timestamp:

create table t1 (a int primary key);
create table t2 (b int primary key);
create table t3 (a int references t1, b int references t2);
truncate t1 cascade;

Here's the relevant part of the stack trace:

github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xa9bf800, 0x7cac095, 0x11, 0xc000ff3e20)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xa9bee60, 0x4, 0x162423d5f5731b70, 0x602, 0xa5bff26, 0x19, 0xfaf, 0xc00077ba40, 0x55, 0x0, ...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:246 +0xa7e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x86e9c00, 0xc000074118, 0xc000000004, 0x2, 0x7d6fa4e, 0x53, 0xc000ff4340, 0x2, 0x2)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:52 +0x1ac
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x86e9c00, 0xc000074118, 0x1, 0x4, 0x7d6fa4e, 0x53, 0xc000ff4340, 0x2, 0x2)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:153
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*Descriptor).MaybeSetModificationTimeFromMVCCTimestamp(0xc000479140, 0x86e9c00, 0xc000074118, 0x0, 0x0)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:4015 +0x666
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*Descriptor).Table(0xc000479140, 0x0, 0xc000000000, 0x7a2d820)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:4034 +0x8f
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.getTableDescFromIDRaw(0x86e9bc0, 0xc00086c400, 0x866a6e0, 0xc0014e0ab0, 0xc000c957e0, 0xc000c957e0, 0x0, 0x37, 0x10, 0xc000ff4588, ...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:453 +0x103
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.GetTableDescFromID(0x86e9bc0, 0xc00086c400, 0x866a6e0, 0xc0014e0ab0, 0xc000c957e0, 0xc000c957e0, 0x0, 0x37, 0x1, 0x0, ...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:406 +0x87
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*TableDescriptor).validateCrossReferences.func1(0x37, 0xc000ff4948, 0xc000000036, 0xc000ff4b80)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:1791 +0x17f
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*TableDescriptor).validateCrossReferences(0xc001524000, 0x86e9bc0, 0xc00086c400, 0xc0014e0ab0, 0xc000c957e0, 0xc000c957e0, 0x0, 0x0, 0x0)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:1836 +0x885
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*TableDescriptor).Validate(0xc001524000, 0x86e9bc0, 0xc00086c400, 0xc0014e0ab0, 0xc000c957e0, 0xc000c957e0, 0x0, 0x0, 0x0)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/structured.go:1765 +0xb0
github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv.unwrapDescriptorMutable(0x86e9bc0, 0xc00086c400, 0xc0014e0ab0, 0xc000c957e0, 0xc000c957e0, 0x0, 0x162423d5f104c8b8, 0x162423d500000000, 0xc0004790a0, 0x0, ...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:163 +0x41a
github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv.GetMutableDescriptorByID(0x86e9bc0, 0xc00086c400, 0xc0014e0ab0, 0xc000c957e0, 0xc000c957e0, 0x0, 0x36, 0xc000ff5488, 0x1, 0x1, ...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:147 +0x236
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getMutableDescriptorByID(0xc001542228, 0x86e9bc0, 0xc00086c400, 0x36, 0xc0014e0ab0, 0x15, 0xc000ff55a0, 0x1, 0x1)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:521 +0x14f
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetMutableTableVersionByID(0xc001542228, 0x86e9bc0, 0xc00086c400, 0xc000000036, 0xc0014e0ab0, 0x15, 0xc000ff55a0, 0x1)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:505 +0x55
github.com/cockroachdb/cockroach/pkg/sql.(*planner).findAllReferences(0xc0015423d8, 0x86e9bc0, 0xc00086c400, 0xc000770c76, 0x2, 0x300000037, 0x0, 0x0, 0x0, 0x0, ...)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/truncate.go:308 +0x26f
github.com/cockroachdb/cockroach/pkg/sql.(*planner).truncateTable(0xc0015423d8, 0x86e9bc0, 0xc00086c400, 0xc000000037, 0xc000f82780, 0x25, 0x100000000, 0x793b8e0, 0xc001310120)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/truncate.go:214 +0x4fd
github.com/cockroachdb/cockroach/pkg/sql.(*truncateNode).startExec(0xc000010e90, 0x86e9bc0, 0xc00086c400, 0xc000c53b80, 0xc0015423d8, 0x6767800, 0xc0014c8100)
    /Users/lucy/go/src/github.com/cockroachdb/cockroach/pkg/sql/truncate.go:131 +0x981

Starting off with 53 -> t1, 54 -> t2, 55 -> t3, the sequence of writes is:

  • put table 53 (t1) in the drop state
  • update references on table 55 (t3) to point to 56 (the new t1)
  • put table 55 (t1) in the drop state

Then we start trying to determine the dependencies of 55 (t3) so we can determine which ones to also truncate, in planner.findAllReferences. That reads table 54 via GetMutableDescriptorByID, which then calls Validate (with our patch), which fetches table 55 again, which causes the panic.

During TRUNCATE, we don't delete any of these descriptors right away (and, obviously, if they were deleted then we wouldn't get values for them). I don't have any insights into what's going on here.

Alright, the problem is pretty clear now, when reading an intent we don't get a timestamp. I'll write up a patch to fix it. The problem ends up being that we are reading a key we've already written to so avoiding doing the re-lookup would save us but nevertheless, let's just fix the bug.

Wait, can you clarify what the the difference is between what you just described and the non-repro from earlier?

Give me a second to post a PR. It has to do with sequence numbers and intent history.

Was this page helpful?
0 / 5 - 0 ratings