Is your feature request related to a problem? Please describe.
CUDA errors can show up asynchronously from things like kernel calls or async CUDA APIs. This is problematic because an error may occur, but it doesn't show up until a future CUDA runtime function call.
For example:
error_kernel<<<...>>>(...); // This kernel will cause an error
auto error = cudaPeakLastError(); // this may or may not return cudaSuccess
auto error = cudaDeviceSynchronize(); // If uncaught above, this is guaranteed to return an error
Without synchronizing, there is no way to guarantee that an error hasn't occurred.
Obviously we do not want to add a bunch of extra synchronization into our code to check for errors as this would cause unacceptable slow-down in release builds. However, we can/should do a better job of regularly checking for errors from any previous, asynchronous launches.
We currently have a utility, CHECK_STREAM, which helps with this, but only in DEBUG builds.
CHECK_STREAM(stream);
This will synchronize stream and check for a CUDA error.
It would be nice in release builds to have CHECK_STREAM check for latent CUDA errors and throw an exception accordingly. In this way, it gives the user an indication that something bad is happening and use a DEBUG build to pinpoint exactly where the error is coming from.
Describe the solution you'd like
Update CHECK_STREAM (maybe rename to better convey what it does? Like CHECK_CUDA?) for release builds to invoke cudaGetLastError() and throw an exception if a latent CUDA error is detected.
We'd also need to update and remind devs on guidance on when/where CHECK_STREAM should be used.
Additional context
Related to conversation in:
https://github.com/rapidsai/rmm/pull/193
https://github.com/rapidsai/cudf/issues/3510
https://github.com/rapidsai/cudf/issues/3158
TO be clear, in release mode CHECK_STREAM should NOT synchronize but only call cudaGetLastError(). In debug mode it should synchronize.
I also forgot that we already have CUDA_CHECK_LAST() which effectively does the same thing as the proposed changes to CHECK_STREAM, but then we'd have to sprinkle both macros around. I'm of the mind that the two should be merged.
So we want to
One issue in merging CHECK_STREAM and CUDA_CHECK_LAST_ERROR is that CHECK_STREAM requires stream (used only on debug build) while stream may not be always available on CUDA_CHECK_LAST_ERROR (unless we assume default stream).
I can think of two options.
Merge CHEK_STREAM & CUDA_CHECK_LAST_ERROR to CHECK_CUDA_ERROR and call cudaDeviceSynchronize() first on debug build and call cudaGetLastError() on both debug & release build. CHECK_CUDA_ERROR no longer takes stream (which is used only on debug build).
Merge CHECK_STREAM & CUDA_CHECK_LAST_ERROR to CHECK_CUDA_ERROR taking stream with default value = 0. This will call cudaStreamSynchronize(stream) on debug build and call cudaGetLastError() on both debug & release build.
I lean towards Option 1 as passing stream on both release & debug build (stream is used only in debug build) can be a bit misleading on release build, and performance impact will be limited only on debug build, but let me know if anyone has concerns or better ideas.
I like option 2, since all of our internal functions that copy memory and run kernels should be taking a stream parameter and following stream semantics.
I like option 2, since all of our internal functions that copy memory and run kernels should be taking a stream parameter and following stream semantics.
OK, I will implement the option 2.
Fixed by #3587