Cudf: [PERF][BUG] scatter is slower than the legacy one in some cases

Created on 21 Apr 2020 · 1Comment · Source: rapidsai/cudf

The current version of scatter is slower than the legacy version for a random scatter_map for tables with large column sizes (~10**6). For example, from the scatter benchmark:

| column size | # of columns | legacy [ns] | current [ns] | regression
:-: | :-: | :-: | :-: | :-: | :-:
double_coalesce_o | 1048576 | 4 | 820361 | 1067749 | 1.28
double_coalesce_o | 2097152 | 4 | 2284413 | 2911159 | 1.27
double_coalesce_o | 4194304 | 4 | 5717116 | 6867768 | 1.20
double_coalesce_o | 8388608 | 4 | 12684844 | 14977435 | 1.18
double_coalesce_o | 16777216 | 4 | 26684885 | 31330737 | 1.17
double_coalesce_o | 33554432 | 4 | 55134710 | 64127488 | 1.16

Describe the solution you'd like

Possible reason: the legacy scatter first inverts the scatter_map to get a gatter_map and performs a gather. For random input scatter_map this turns random store memory access to random load memory, which is usually cheaper than the former.

Needs to:

[x] Profile the current and legacy scatter kernel to confirm the cause.

Performance bug

Source

hummingtree

Most helpful comment

Finally being able to profile the benchmarks with nsys. With column size of 8388608 and 4 columns in a table, the breakdown of the legacy and current random scatter is:
[ms] | memcopy | data | bitmask | invert map | total
:-: | :-: | :-: | :-: | :-: | :-:
legacy | 1.09 | 8.52 | 0.74 | 2.36 | 12.75
current | 1.09 | 11.56 | 2.38 | 0.00 | 15.10
deficit | 0.00 | +3.04 | +1.64 | -2.36 | +2.35

So overall inverting the scatter_map into a gather_map saves time for large column sizes.

hummingtree on 30 Apr 2020

👍2