Cudf: [FEA] add API to libcudf so we can query at runtime whether it was compiled for PTDS

Created on 28 Aug 2020  路  14Comments  路  Source: rapidsai/cudf

When running the RAPIDS spark plugin, we'd like to be able to log if PTDS is on or off in our environment.

In order to do this, we'll need to query libcudf at runtime to know if it's compiled for PTDS or not. We'd like to propose some static variable defined if PTDS is on so we can inspect it at runtime. Something like:

#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM
bool cudf::ptds_enabled = true;
#else
bool cudf::ptds_enabled = false;
#endif

Not sure where it should go or if the above make sense, but something to that effect, so we can include a header in the jni side and ask for this to check for our log.

Spark feature request libcudf

Most helpful comment

This actually raises an interesting question... Maybe we should provide an API cudf::setDefaultStream(cudaStream_t stream) like other libraries (cuDNN, cuBLAS) to override the PTDS behavior regardless of how it was compiled.

Then you could call cudf::setDefaultStream(cudaStreamLegacy) and it would always use the legacy default stream (even if compiled with PTDS enabled). You could call cudf::setDefaultStream(cudaStreamPerThread) and it would always use the per-thread default stream (even if compiled with PTDS disabled).

This would require changing every place that currently passes 0 or nullptr for stream to instead call cudf::getDefaultStream().

But it would make the library much more friendly to applications that want runtime control over PTDS.

CC @jrhemstad @kkraus14 for opinions.

All 14 comments

This actually raises an interesting question... Maybe we should provide an API cudf::setDefaultStream(cudaStream_t stream) like other libraries (cuDNN, cuBLAS) to override the PTDS behavior regardless of how it was compiled.

Then you could call cudf::setDefaultStream(cudaStreamLegacy) and it would always use the legacy default stream (even if compiled with PTDS enabled). You could call cudf::setDefaultStream(cudaStreamPerThread) and it would always use the per-thread default stream (even if compiled with PTDS disabled).

This would require changing every place that currently passes 0 or nullptr for stream to instead call cudf::getDefaultStream().

But it would make the library much more friendly to applications that want runtime control over PTDS.

CC @jrhemstad @kkraus14 for opinions.

Any place that libcudf is calling Thrust without an explicit stream is a bug -- if you see 'em, report 'em. That said, I don't think it's possible to pass a stream to a device_vector constructor, for example, so if PTDS is enabled presumably it uses the PTDS for uninitialized_fill whether you like it or not. Use rmm::device_uvector wherever possible for that reason.

Would cudf::setDefaultStream set the stream for only the local thread or for all threads in a process?

Would cudf::setDefaultStream set the stream for only the local thread or for all threads in a process?

As far as I understand it would be a default option, where if you still provide your own hard coded stream to a function, the function must use the stream that was user provided.

Use rmm::device_uvector wherever possible for that reason

So rmm::device_uvector isn't a subclass of thrust::device_vector, so it can't really be used in our case, since we depend on thrust pretty extensively. @harrism am I missing something? Do you think we need to make device_uvector a subclass of the thurst one? Or maybe work on a thrust change to not use stream 0 for thrust::device_vector.

You probably saw this one already: https://github.com/thrust/thrust/issues/1017

Use rmm::device_uvector wherever possible for that reason

So rmm::device_uvector isn't a subclass of thrust::device_vector, so it can't really be used in our case, since we depend on thrust pretty extensively. @harrism am I missing something? Do you think we need to make device_uvector a subclass of the thurst one? Or maybe work on a thrust change to not use stream 0 for thrust::device_vector.

You probably saw this one already: https://github.com/thrust/thrust/issues/1017

Never mind. We only use thrust::device_vector in the tests.

As discussed offline, rmm::device_uvector can be used with Thrust APIs (as can raw device pointers).

Would cudf::setDefaultStream set the stream for only the local thread or for all threads in a process?

it does not set _the_ stream, it sets the _default_ stream. libcudf public APIs do not currently take a stream parameter. So this would just change what stream internal APIs use when they currently use stream 0. This setting would be for all threads.

it does not set _the_ stream, it sets the _default_ stream. libcudf public APIs do not currently take a stream parameter. So this would just change what stream internal APIs use when they currently use stream 0. This setting would be for all threads.

Is the idea for a user to then call something like cudf::setDefaultStream(cudaStreamPerThread) to get PTDS semantics?

I tentatively like the idea of having a set_default_stream API that allows controlling the default value of the stream argument in libcudf APIs.

My feeling is that this should be something that's only allowed to be set exactly once. Changing the default stream multiple times throughout runtime could have unforeseen, nasty side effects.

Is the idea for a user to then call something like cudf::setDefaultStream(cudaStreamPerThread) to get PTDS semantics?

Yes.

My feeling is that this should be something that's only allowed to be set exactly once. Changing the default stream multiple times throughout runtime could have unforeseen, nasty side effects.

Can you give an example? I think set_default_stream could call cudaDeviceSynchronize() to alleviate any ill side effects.

Can you give an example?

I'm worried about situations where someone expects the value of get_default_stream() to be the same from one call to the next, e.g., something like this:

void* p = get_current_device_resource()->allocate(n, get_default_stream());
...
set_default_stream(...); // change the default stream from above
...
get_current_device_resource()->deallocate(p, n, get_default_stream()); // Error: free on a different stream than allocated on w/o synchronization

I think set_default_stream could call cudaDeviceSynchronize() to alleviate any ill side effects.

Adding a device sync would probably alleviate any of my concerns.

Something else to consider is if we should really allow set_default_stream to be set to any arbitrary cudaStream_t or restrict to a choice between cudaStreamLegacy and cudaStreamPerThread. Otherwise, if it can be any arbitrary stream, that has nearly the same effect as just exposing a cudaStream_t argument in every public API.

I think you are right, we should limit it to cudaStreamLegacy and cudaStreamPerThread (and possibly cudaStreamDefault). In other words, you can have any stream you want as the default as long as it is one of the default streams. :) (Incidentally I hadn't even thought of it as a way to set an arbitrary stream as the default -- really we should only keep references to streams that can't be deleted by the user.)

If multiple threads called it with multiple different non-default streams the resulting confusion could be very difficult to debug.

6324 is probably more than what we need now. Are you guys open to add the simpler version originally requested?

Was this page helpful?
0 / 5 - 0 ratings