ORC and parquet writer kernels use a block reduction or a block scan but write out the entire logic inside the kernel.
https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L431-L434
https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L266-L284
This code is doing a block scan on dupes_before but the code for this operation is interspersed with other logic.
__syncthreads() can be called.Use an off the shelf cub implementation
This is almost certainly the way to go. It's likely that the CUB implementations will be faster as well.
Use an off the shelf cub implementation
This is almost certainly the way to go. It's likely that the CUB implementations will be faster as well.
I'd like this too but I get tempted by the idea of calling a block reduce by active threads.
I'd like this too but I get tempted by the idea of calling a block reduce by active threads.
Don't all threads contribute to the final reduction result?
They do but the last iteration might not. For that purpose, instead of
for (size_t i = start; i < end; i += stride) {
// data[i]
}
It currently uses
for (size_t i = 0; i < size; i += stride) {
if (i + t < end) data[i + t];
}
e.g. here https://github.com/rapidsai/cudf/blob/57ef76927373d7260b6a0eda781e59a4c563d36e/cpp/src/io/orc/dict_enc.cu#L399-L400
They do but the last iteration might not.
Is there useful work for the remaining threads to do after the last iteration where it is actually detrimental to synchronize them?
It looks to me in the dict_enc example that the remaining threads are still participating in the reduction, but only contributing 0 to the find reduction result.
Is there useful work for the remaining threads to do after the last iteration where it is actually detrimental to synchronize them?
Not for reducing work, this is just for readability of the loop and things inside it. Otherwise most operations would need to have a if (i + t < end) before them.
But there are other reasons to allow a __syncthreads() inside a loop, like calling another block function like RleEncode which itself uses __syncthreads(). So it's not too important.
Replacing the current code with cub::BlockReduce would be a great improvement anyway.
@devavret could using the valid_items parameter with cub::BlockReduce (link) help in handling the last iteration?
@devavret could using the
valid_itemsparameter with cub::BlockReduce (link) help in handling the last iteration?
If you mean that it might obviate the need to set out of bounds elements to 0 then it appears so from the docs. If you mean it might allow us to write our loops in for (size_t i = start; i < end; i += stride) { style then perhaps not. cub::BlockReduce is not syncthreads free.
Keep in mind that a lot of what cub does is abstract away differences between pre-sm_50 and maxwell+ (existence shfl).
Perhaps I missed something, but my early experiments with cub reductions invariably showed extremely high shared mem usage (like 4K smem storage for a block reduction) and much higher register usage that the explicit equivalent butterfly shuffle reduction (this is largely why I did not end up relying on cub in cuio kernels).
If that's the case we can file a CUB issue and get it fixed (or fix it ourselves and contribute it). Sharpen our tools.
I doubt there is anything to fix.
CUB has 3 different algorithms you can select for the block reduction: https://nvlabs.github.io/cub/namespacecub.html#add0251c713859b8974806079e498d10a
The pros/cons of each are described in the documentation. Shmem/register usage only matter if they negatively impact occupancy.
Just trying to help. All I'm saying is make sure to check such assumptions against reality.
Just trying to help. All I'm saying is make sure to check such assumptions against reality.
How should one check these? What's your tool of choice? NSight compute/ cuobjdump? I'm asking because I haven't checked shared mem usage or register usage of a kernel before.
How should one check these? What's your tool of choice? NSight compute/ cuobjdump? I'm asking because I haven't checked shared mem usage or register usage of a kernel before.
cubobjdump --dump-resource-usage is the easiest way to see the raw resource usage of a kernel. Nsight Compute can then be used to see if any change in resource usage affected occupancy.
Personally, I always use the ptxas -v option at build time during development (can't imagine doing any serious cuda work without it). I also consider any non-zero lmem usage a bug with very rare exceptions.