There was a bug in the parallel_scan implementation where the correct inter-block fence (__threadfence()) wasn't called. This would manifest as a wrong result of the scan.
PRs #2666 and #2668
@stanmoore1 found the reproducer which made it possible to find this thing ...
Closing this issue as this is now in master
Most helpful comment
@stanmoore1 found the reproducer which made it possible to find this thing ...