The current version of scatter is slower than the legacy version for a random scatter_map for tables with large column sizes (~10**6). For example, from the scatter benchmark:
聽 | column size | # of columns | legacy [ns] | current [ns] | regression
:-: | :-: | :-: | :-: | :-: | :-:
double_coalesce_o | 1048576 | 4 | 820361 | 1067749 | 1.28
double_coalesce_o | 2097152 | 4 | 2284413 | 2911159 | 1.27
double_coalesce_o | 4194304 | 4 | 5717116 | 6867768 | 1.20
double_coalesce_o | 8388608 | 4 | 12684844 | 14977435 | 1.18
double_coalesce_o | 16777216 | 4 | 26684885 | 31330737 | 1.17
double_coalesce_o | 33554432 | 4 | 55134710 | 64127488 | 1.16
Describe the solution you'd like
Possible reason: the legacy scatter first inverts the scatter_map to get a gatter_map and performs a gather. For random input scatter_map this turns random store memory access to random load memory, which is usually cheaper than the former.
Needs to:
Finally being able to profile the benchmarks with nsys. With column size of 8388608 and 4 columns in a table, the breakdown of the legacy and current random scatter is:
[ms] | memcopy | data | bitmask | invert map | total
:-: | :-: | :-: | :-: | :-: | :-:
legacy | 1.09 | 8.52 | 0.74 | 2.36 | 12.75
current | 1.09 | 11.56 | 2.38 | 0.00 | 15.10
deficit | 0.00 | +3.04 | +1.64 | -2.36 | +2.35
So overall inverting the scatter_map into a gather_map saves time for large column sizes.
Most helpful comment
Finally being able to profile the benchmarks with
nsys. With column size of8388608and4columns in a table, the breakdown of the legacy and current random scatter is:[ms] | memcopy | data | bitmask | invert map | total
:-: | :-: | :-: | :-: | :-: | :-:
legacy | 1.09 | 8.52 | 0.74 | 2.36 | 12.75
current | 1.09 | 11.56 | 2.38 | 0.00 | 15.10
deficit | 0.00 | +3.04 | +1.64 | -2.36 | +2.35
So overall inverting the
scatter_mapinto agather_mapsaves time for large column sizes.