Cudf: [FEA] Abstract away block reduce and block scan from cuIO kernels

Created on 15 Sep 2020  ·  15Comments  ·  Source: rapidsai/cudf

ORC and parquet writer kernels use a block reduction or a block scan but write out the entire logic inside the kernel.

Here's a block reduction:

https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L431-L434

Here's a block scan:

https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L266-L284

This code is doing a block scan on dupes_before but the code for this operation is interspersed with other logic.

Proposed fix

  • Either implement a block scan and a block reduction and add it to block_utils.cuh. It would be helpful if this could be called by just the active threads and not all threads (perhaps with creative use of coalesced threads masks) but not too important as all the places it's called in already ensures __syncthreads() can be called.
  • Use an off the shelf cub implementation.
cuIO feature request tech debt

All 15 comments

Use an off the shelf cub implementation

This is almost certainly the way to go. It's likely that the CUB implementations will be faster as well.

Use an off the shelf cub implementation

This is almost certainly the way to go. It's likely that the CUB implementations will be faster as well.

I'd like this too but I get tempted by the idea of calling a block reduce by active threads.

I'd like this too but I get tempted by the idea of calling a block reduce by active threads.

Don't all threads contribute to the final reduction result?

They do but the last iteration might not. For that purpose, instead of

for (size_t i = start; i < end; i += stride) {
  // data[i]
}

It currently uses

for (size_t i = 0; i < size; i += stride) {
  if (i + t < end) data[i + t];
}

e.g. here https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L399-L400

They do but the last iteration might not.

Is there useful work for the remaining threads to do after the last iteration where it is actually detrimental to synchronize them?

It looks to me in the dict_enc example that the remaining threads are still participating in the reduction, but only contributing 0 to the find reduction result.

Is there useful work for the remaining threads to do after the last iteration where it is actually detrimental to synchronize them?

Not for reducing work, this is just for readability of the loop and things inside it. Otherwise most operations would need to have a if (i + t < end) before them.

But there are other reasons to allow a __syncthreads() inside a loop, like calling another block function like RleEncode which itself uses __syncthreads(). So it's not too important.

Replacing the current code with cub::BlockReduce would be a great improvement anyway.

@devavret could using the valid_items parameter with cub::BlockReduce (link) help in handling the last iteration?

@devavret could using the valid_items parameter with cub::BlockReduce (link) help in handling the last iteration?

If you mean that it might obviate the need to set out of bounds elements to 0 then it appears so from the docs. If you mean it might allow us to write our loops in for (size_t i = start; i < end; i += stride) { style then perhaps not. cub::BlockReduce is not syncthreads free.

Keep in mind that a lot of what cub does is abstract away differences between pre-sm_50 and maxwell+ (existence shfl).
Perhaps I missed something, but my early experiments with cub reductions invariably showed extremely high shared mem usage (like 4K smem storage for a block reduction) and much higher register usage that the explicit equivalent butterfly shuffle reduction (this is largely why I did not end up relying on cub in cuio kernels).

If that's the case we can file a CUB issue and get it fixed (or fix it ourselves and contribute it). Sharpen our tools.

I doubt there is anything to fix.

CUB has 3 different algorithms you can select for the block reduction: https://nvlabs.github.io/cub/namespacecub.html#add0251c713859b8974806079e498d10a

The pros/cons of each are described in the documentation. Shmem/register usage only matter if they negatively impact occupancy.

Just trying to help. All I'm saying is make sure to check such assumptions against reality.

Just trying to help. All I'm saying is make sure to check such assumptions against reality.

How should one check these? What's your tool of choice? NSight compute/ cuobjdump? I'm asking because I haven't checked shared mem usage or register usage of a kernel before.

How should one check these? What's your tool of choice? NSight compute/ cuobjdump? I'm asking because I haven't checked shared mem usage or register usage of a kernel before.

cubobjdump --dump-resource-usage is the easiest way to see the raw resource usage of a kernel. Nsight Compute can then be used to see if any change in resource usage affected occupancy.

Personally, I always use the ptxas -v option at build time during development (can't imagine doing any serious cuda work without it). I also consider any non-zero lmem usage a bug with very rare exceptions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

galipremsagar picture galipremsagar  ·  3Comments

jmkim picture jmkim  ·  3Comments

AjayThorve picture AjayThorve  ·  3Comments

shwina picture shwina  ·  3Comments

ericmjl picture ericmjl  ·  3Comments