Cockroach: sql: column backfills should build a new index instead of mutating the existing one

Created on 23 Apr 2020 · 9Comments · Source: cockroachdb/cockroach

Currently, when a backfill is needed for a column being added or dropped, the column backfiller rewrites the existing primary index in-place to add or remove the column value. This has some disadvantages compared to the implementation of index backfills (for non-interleaved indexes):

The index backfiller is much faster because it batches KVs and writes SSTs directly.
Column backfills are more difficult to clean up. A index being built has no effect on other indexes and can be GC'ed using ClearRanges (which is what the index GC job currently does). A column backfill can only be cleaned up by running the "reverse" column backfill.
- This poses problems for restoring backups taken in the middle of a column backfill (#44019). Currently the best solution is to create a new job to resume the column schema schema change, but it would be good to have the option of being able to simply ignore the column.
- We have bugs that prevent column backfills from being completely rolled back (#47712). In the worst case, it would be good to have the option of ignoring the partially backfilled column and clearing the mutation. Currently, the presence of column values with no accompanying column mutation on the table descriptor might make the KVs unreadable.
Dropping a column requires deleting values. If we have to roll back dropping a column and want to recover the previous values (which we currently don't do; see #46541), the only way to get them back is via MVCC.

The proposal is to change the implementation of column backfills to build a new (primary) index every time, instead of mutating the existing one. There are 2 proposed steps in this change (which are somewhat independent):

Switch the column backfill implementation so that it works like today's index backfiller. Adding or dropping any column would be like a primary key change: we go through the states of adding a new index until we're writing to both indexes, then do an index backfill, then swap the primary index in the final table descriptor update of the schema change.
There are performance concerns with just doing the first step. A proposed optimization: Reorder the steps in adding an index so that the backfill happens before any live traffic to the new index is turned on, then start writing to the new index and use a rangefeed to catch up on writes in the gap between the backfill and when all nodes are writing to the new index. See #45888 and #36850 for details. We'd want to do some speed-of-light experiments before committing to this approach.

Rolling back a column schema change, with this proposed approach, would essentially mean switching back to the pre-existing index. If we rolled back dropping a column, we still have all the column data as of the timestamp when the schema change was started.

This approach would also allow us to do ALTER TYPE in a single index backfill operation (see #46893).

Open questions:

What do we do for interleaved tables? Very preliminary thoughts:
- For interleaved tables with no interleaved children, we should just be able to write the new primary index alongside the old one. I'm not sure what the implications are for cleaning up an index if we don't want to/can't roll it back normally. Presumably we can't just leave abandoned index KVs interleaved in some other table with no indication of this on any of the table descriptors.
- For interleaved parents, can we do this without rewriting all the child interleaved indexes as well?

A-schema-changes

Source

lucy-zhang

Most helpful comment

@ajwerner @rohany @RichardJCai and I discussed this in person.

First a few notes, then the main problem:

This proposal can be thought of as a generalization of primary key changes: given an existing primary index, the index backfiller will be able to produce a new primary index with a specified list of columns and a specified primary key, in addition to being able to produce new secondary indexes (with specified lists of indexed columns and stored columns).
- Idea for later: In theory, the source for the index backfiller doesn't always need to be the primary index. For instance, when rewriting a secondary index after changing a column's type, we could scan the original secondary index to build a copy with just that column changed. This would allow sorted ingestion for the index backfiller in some cases, which is ideal for producing non-overlapping SSTs, and it would also make primary and secondary indexes more symmetric.
We could implement this without changing how column mutations are created and queued. The schema changer would generate a set of new indexes to build based on the set of column mutations (and PK change mutations, etc.). This is a step toward making mutations more of a "logical" representation of pending schema changes as opposed to a place to store execution-related state.

The main problem with this proposal ("the generalized index backfiller") is with interleaved tables: If an index has interleaved children, it's not possible to write an updated index with a new index ID without also rewriting all its interleaved descendants. Depending on whether we decide this would be prohibitively expensive, we have two options:

Never rewrite the interleaved descendants of a table. We only implement the generalized index backfiller for non-interleaved tables and interleaved tables without children, and fall back to the existing column backfiller for interleave parents.
Implement the generalized index backfiller for all tables, which rewrites descendant tables when necessary.

(NB: The interleaved ancestors of the table would not need to be rewritten. The index backfiller can already backfill secondary indexes interleaved into another table.)

Issues with option 1:

The benefits listed above no longer apply for interleaved parent tables. We also introduce complexity by having a separate implementation just for interleaved tables (which, anecdotally, are currently under-tested relative to their complexity).
We lose the benefit of getting rid of the existing column backfiller entirely and having a single, unified implementation. @dt points out that the column and index backfillers are so different that having both under the same "backfiller" interface makes it impossible to optimize either.
For our current async non-transactional schema changes, the reordering of backfill steps to prohibit rolling back a column drop described above will now require 2 column backfills instead of 1 when there are both added and dropped columns in the same transaction. For the same reason, ALTER COLUMN TYPE will require 2 column backfills (as in the original RFC draft).
Having an index backfiller that doesn't touch the existing index will likely make transactional schema changes much easier to implement in the future, since we won't have to worry about other transactions seeing backfilled values for a non-public column in the original primary index, and we won't have to do a second backfill for rollbacks.

Issues with option 2:

Rewriting all the descendant tables is obviously expensive. If a child table is much larger than its parent table, it'll violate user expectations when the cost of doing a column schema change on the parent table is proportional to the size of the child table.
Writes to a child table will miss out on the 1PC optimization for the duration of a backfill initiated by an ancestor table, since writes will go to both the old and new indexes. I think this is no longer an issue if we end up implementing #36850.
A backfill that rewrites descendant tables requires coordinating a schema change across multiple tables, so that the table descriptor versions move in lockstep with each other. We'll have to update the schema change job to publish updates to multiple tables at once.

There are a few special performance concerns when backfilling an interleaved index compared to a non-interleaved index (applies to both options, but more so to (2)). We'll always be writing SSTs that overlap with existing keys in the span even if the new keys themselves are sorted, and there's some impact on traffic on the parent tables due to latch acquisition (not sure how significant this is). All this is already true for interleaved secondary indexes and interleaved ALTER PRIMARY KEY.

I think the next step is to figure out whether we can tolerate rewriting all the descendant tables for option (2).

lucy-zhang on 29 Apr 2020

👍3

All 9 comments

Big 👍 .

Rolling back a column schema change, with this proposed approach, would essentially mean switching back to the pre-existing index. If we rolled back dropping a column, we still have all the column data as of the timestamp when the schema change was started.

Out of curiosity, when do we have to roll back dropping a column?

petermattis on 23 Apr 2020

Out of curiosity, when do we have to roll back dropping a column?

The schema change can be cancelled, or another schema change started in the same transaction can fail (e.g., a unique index can't be built), in which case we roll everything back.

lucy-zhang on 23 Apr 2020

This would also fix https://github.com/cockroachdb/cockroach/issues/35738.

ajwerner on 23 Apr 2020

Out of curiosity, when do we have to roll back dropping a column?

The schema change can be cancelled, or another schema change started in the same transaction can fail (e.g., a unique index can't be built), in which case we roll everything back.

Got it. #46541 seems not good. Trying to roll back a partially dropped column seems fraught. I wonder if we can arrange to perform the drop column schema changes last. And then not allow them to be rolled back.

petermattis on 24 Apr 2020

I wonder if we can arrange to perform the drop column schema changes last. And then not allow them to be rolled back.

We've talked about something like this in relatively vague terms. It seems like so long as we leave the dropped column public until all other mutations which can fail complete, and then make sure we remove all constraints before dropping a column, then make the drop column mutation something which cannot be canceled then we'll be out of this pickle.

Today we assume that the transaction which performs the drop sees the column in WRITE_ONLY for subsequent statements. My thinking is we should have that be true for that transaction but not actually commit the table descriptor in write only. This also gives us an opportunity to drop the constraints prior to making the dropped column non-public. That too is hazardous to roll back. My sense is as soon as we start dropping properties we need to make it impossible to roll back. In short:

1) Commit Time

Create new columns and queue mutations
Add new columns in delete-only
2) Add columns, indexes, and constraint
- Still can cancel or fail
- Add new columns in write-only, backfill new indexes, attempt to add constraints
  
  3) Drop constraints
- Cannot cancel or fail
- Drop constraints which have been dropped
- Drop constraints for columns which are being dropped
  
  4) Drop columns
- cannot cancel or fail

ajwerner on 24 Apr 2020

I agree with all this. One thing to clarify: the new index that gets built in step (2) will exclude the columns being dropped, right? So if we have to roll back after validation at the end of step 2, we keep using the original index as though nothing happened. Otherwise, step (4) (which involves multiple table descriptor versions) consists of stepping through the states for the dropped column (but with no backfill needed), and the very last step is swapping to the new index.

lucy-zhang on 24 Apr 2020

This all sounds fantastic.

Will we collapse multiple related schema changes into a single step? If we add two columns to a table in a transaction, would we write a single new primary key index, or two. This relates to Lucy's point about dropping a column at the same time we're rewriting the primary index to add a column.

petermattis on 24 Apr 2020

Will we collapse multiple related schema changes into a single step? If we add two columns to a table in a transaction, would we write a single new primary key index, or two.

I think we'll always be able to execute multiple schema changes started in the same transaction with just one rewrite of the primary index. That rewrite can encompass all column add/drop operations as well as changes to the primary key.