Describe the bug
While trying to add a new enum element to a column
ALTER TABLE table MODIFY COLUMN action Enum8(
'unknown' = 0,
'known' = 1
);
under production workload, queries started to fail with the following exception
DB::Exception: Type mismatch for column action. Column has type Enum8('unknown' = 0, 'known' = 1), got type Enum8('unknown' = 0).
Inspecting the code led me to the following lines: https://github.com/yandex/ClickHouse/blob/e24af1d765770507ee0297625649568e625d1d61/dbms/src/Storages/ITableDeclaration.cpp#L190-L224
Seems that structure of MergeTreeData was modified, but MergeTreeDataPart structure not yet.
Inspecting alter and select implementation I found that alter gets a data lock first (a1) and after modifying MergeTreeData columns (data.checkAlter(params); call) it locks data structure (a2). select just locks the structure. (b1)
If we have the following execution sequence: a1, b1, a2 which seems possible, then we have a concurrency bug.
Additional context
Version: 18.14.17
Let me know if this makes sense.
So far couldn't reproduce, maybe the cause lies somewhere else.
Is the table replicated or not?
For non-replicated tables the locking proceeds as follows:
data_lock is locked exclusively. BTW the lock names are super confusing, but this data_lock is locked in the shared mode by inserts and merges, so if you lock it exclusively, you can be sure that no new parts appear in the table. After this point there are no concurrent ALTERs and INSERTs.data.checkAlter(params) statement, note that it doesn't modify anything).structure_lock is locked exclusively. This additionally locks out SELECTs.So in this case INSERTs are locked for quite a long time (while new part files are prepared), but SELECTs can proceed (using the old structure).
For Replicated tables the tradeoffs are different. Look here for source code of the background task that actually performs the alter.
structure_lock is locked exclusively. This locks out SELECTs, INSERTs and ALTERs. Table structure is modified, and the lock is then unlocked. This means that INSERTs can proceed immediately (using the new table structure), but SELECTs will throw exceptions until all parts are altered.structure_lock is locked in shared mode. This locks out concurrent ALTERs (and only them). The parts are modified and the lock is released. After this the ALTER is finished and SELECTs can proceed as usual.So if you observe this effect for Replicated tables, this is, well, by design.
I think that described behaviour is acceptable for alters which do really make some significant change in the table structure.
But with Enums... it's quite stupid, it should be really cheap operation, as no data should be changed, only type description.
@ztlpn it does make sense now. This happened for replicated table, didn't pay attention that there are different alter implementations.
@filimonov I fully agree with you, let's create a feature request for it if one does not exist yet.
later edit: it is already metadata only change, however table metadata is modified before part metadata if I understand correctly and this causes the error mentioned above. See isMetadataOnlyConversion