Cockroach: sql: naive FK check breaks write pipelining

Created on 13 Mar 2020  路  9Comments  路  Source: cockroachdb/cockroach

When deleting a parent row which cascade-deletes some children, it appears that we don't take advantage of the fact that we've just deleted all the children in order to elide the check that the parent has no children and thus is can be deleted.

This is bad, not only because we do a useless read but, much worse, that useless read overlaps with the query's previous writes (i.e. the deletes of the children), which means a stall in our write pipelining. Meaning, the superfluous scan gets a bunch of QueryIntent requests tagging along, which block for the replication of the previous delete requests.

I think this is kind of a biggie; I think such situations are common. I'm not simply speculating about this fact; I've seen it oh so painfully in the wild. Repro below.

I believe the problem comes from this code:
https://github.com/cockroachdb/cockroach/blame/c097a16427f65e9070991f062716d222ea5903fe/pkg/sql/row/deleter.go#L181
There we always add a scan for each FK, ignoring the fact that we've already dealt with the FKs that define cascading actions - just above.

@jordanlewis do you think we can patch something up? I know @RaduBerinde is planning on deleting all this code eventually, but I hear it's not imminent.

Repro below. Check for QueryIntent in the output trace. That's the sign of a pipeline stall.

CREATE TABLE parent (
    id INT8 NOT NULL DEFAULT unique_rowid(),
    CONSTRAINT "primary" PRIMARY KEY (id ASC),
    FAMILY "primary" (id)
);
CREATE TABLE child (
    serial_number_id INT8 NOT NULL,
    storage_node_id BYTES NOT NULL,
    CONSTRAINT "primary" PRIMARY KEY (serial_number_id ASC, storage_node_id ASC),
    CONSTRAINT fk_serial_number_id_ref_serial_numbers FOREIGN KEY (serial_number_id) REFERENCES parent(id) ON DELETE CASCADE,
    FAMILY "primary" (serial_number_id, storage_node_id)
);
INSERT INTO parent VALUES (1);
INSERT INTO child VALUES (1, 'node1'), (1, 'node2'), (1, 'node3');
SET tracing = ON;
DELETE FROM parent WHERE id = 1;
SET tracing = OFF;
SHOW TRACE FOR SESSION;
C-bug

Most helpful comment

We don't need to check if a delete succeeds. We don't do that on actual DELETEs, why would we need it for cascades? I think it's simply a matter of plumbing the information of what we don't need to check.

All 9 comments

@rohany would you be able to take a look at this? Thanks!

(attempting to clarify what causes the pipeline stall)
When we perform this delete, we do these KV ops:

Del /Table/parent/1/1/0
CascadeScan /Table/child/1/{1-2} (looking for entries in child with id = 1)
CascadeScan /Table/child/1/1/"node1"{-/#} (scans for each full child row to delete)
CascadeScan /Table/child/1/1/"node2"{-/#} ...
CascadeScan /Table/child/1/1/"node3"{-/#} ...
Del /Table/child/1/1/"node1"/0 (actually delete each child row)
Del /Table/child/1/1/"node2"/0 ...
Del /Table/child/1/1/"node3"/0 ...
FKScan /Table/60/1/{1-2} (the check you point out here)

I don't think we can eliminate the check you point out because it is used to verify that we indeed did delete all the rows we think we did.

Do these CascadeScan operations interrupt the write pipeline or only this last FKScan?

I don't think we can eliminate the check you point out because it is used to verify that we indeed did delete all the rows we think we did.

How do you mean? Why wouldn't we have deleted all the rows that we think we have deleted?

Do these CascadeScan operations interrupt the write pipeline or only this last FKScan?

Only the last FKScan - because it overlaps with the previous writes. /Table/60 in that output is the child, right?

Only the last FKScan - because it overlaps with the previous writes. /Table/60 in that output is the child, right?

Yup, I forgot to replace the tableid.

How do you mean? Why wouldn't we have deleted all the rows that we think we have deleted?

I'm not sure how to interpret this code. In the case we aren't cascading, then the scan here actually performs correctness checks. If we do cascade, I _think_ the fk checks could be omitted, but they might be present to verify that the cascade deleted everything it should have? cc @BramGruneir if you remember writing this.

We don't need to check if a delete succeeds. We don't do that on actual DELETEs, why would we need it for cascades? I think it's simply a matter of plumbing the information of what we don't need to check.

Without that check, I still see some QueryIntents in the trace --

  output row: ['defaultdb']
  rows affected: 1
  Scan /Table/52/1/1{-/#}
  querying next range at /Table/52/1/1
  r33: sending batch 1 Scan to (n1,s1):1
  fetched: /parent/primary/1 -> NULL
  Del /Table/52/1/1/0
  cascading delete into table: 53 using index: 1
  CascadeScan /Table/53/1/{1-2}
  querying next range at /Table/53/1/1
  r34: sending batch 1 Scan to (n1,s1):1
  CascadeScan /Table/53/1/1/"node1"{-/#}
  CascadeScan /Table/53/1/1/"node2"{-/#}
  CascadeScan /Table/53/1/1/"node3"{-/#}
  querying next range at /Table/53/1/1/"node1"
  r34: sending batch 3 Scan to (n1,s1):1
  Del /Table/53/1/1/"node1"/0
  Del /Table/53/1/1/"node2"/0
  Del /Table/53/1/1/"node3"/0
  querying next range at /Table/53/1/1/"node1"/0
  r34: sending batch 3 Del to (n1,s1):1
  querying next range at /Table/52/1/1/0
  querying next range at /Table/53/1/1/"node1"/0
  r34: sending batch 1 EndTxn to (n1,s1):1
  querying next range at /Table/53/1/1/"node1"/0
  r33: sending batch 1 Del to (n1,s1):1
  r34: sending batch 3 QueryIntent to (n1,s1):1
  fast path completed
  rows affected: 1
  output row: ['defaultdb']
  rows affected: 1
  output row: ['defaultdb']
  rows affected: 1

Yeah, that makes sense radu.

Without that check, I still see some QueryIntents in the trace

QueryIntents sent with or after the EndTxn are ok. You need to verify writes at the end of the transaction, but the point is that you want to defer everything until then - so that the replication latency is hidden by the transaction doing other things.

Fixed by #46121

Was this page helpful?
0 / 5 - 0 ratings