We are adding Per-Thread Default Stream (PTDS) support to RMM memory resources, and we need to ensure we are confident of libcudf's correct functionality in this mode. This issue serves as a place to discuss testing requirements.
In rapidsai/rmm#425 I am currently adding PTDS support to pool_memory_resource. For testing, I basically started from the single-threaded RMM memory_resource tests and wrapped them in a loop like this:
constexpr std::size_t num_threads{4};
template <typename Task, typename... Arguments>
void spawn(Task task, Arguments... args)
{
std::vector<std::thread> threads;
threads.reserve(num_threads);
for (int i = 0; i < num_threads; ++i)
threads.emplace_back(std::thread(task, args...));
for (auto& t : threads)
t.join();
}
Here Task is the function that implements the single-threaded test. Now on the surface, this is just embarrassingly parallel testing -- there are no explicit dependencies between these threads or their default streams. But inside of the memory resource there are data structures that are shared (free lists, etc). So this type of testing should be sufficient for RMM.
The question is, is simple parallel fork-join testing like this sufficient for libcudf unit testing? I would argue that it probably is, because the common use case will be to call libcudf APIs in independent threads. I believe that anything that, for example, accesses a single table simultaneously from multiple threads/streams will either be
a) internal to libcudf with explicit streams (not PTDS streams) and therefore already tested, or
b) external to libcudf and therefore should be tested externally to libcudf or in some sort of integration test.
But I wanted to open this point up for debate.
Also, even adding this type of simple loop around existing tests is probably significant work and requires significant test refactoring, so that is worth discussing as well.
And of course the above also applies to pytests -- do we need additional multithreaded pytests?
CC @rongou @jrhemstad @kkraus14 @sameerz
I wouldn't put a lot of faith in using Python for test coverage here. The GIL can easily end up hiding race conditions (especially if from the libcudf side we're returning once the kernel launch is enqueued on the stream).
We can run all of the existing dask-cudf unit tests with per thread default stream though in a threaded scheduler for example. Just not sure how much value it really adds.
I don't think libcudf should provide _any_ multi-threaded testing.
Multi-threaded testing is wholly outside the sphere of libcudf's concern. libcudf is a singled-threaded, single GPU library, and is agnostic of anything to do with multi-threaded, distributed, multi-gpu use cases. There is no mutable, shared, global state that we have to worry about multiple threads using libcudf concurrently.
In contrast, RMM _does_ need to provide multi-threaded testing as there is global, shared state among threads using the same resource that we have to make sure is free of race conditions.
We should make sure our current (single-threaded) tests work in PTDS mode as the semantics of the default stream change (no longer implicitly syncs with other streams). However, we should not need to perform tests exercising libcudf concurrently among multiple threads.
There is no mutable, shared, global state that we have to worry about multiple threads using libcudf concurrently.
Is this still true with things like GroupBy? What about column_view null count?
Is this still true with things like GroupBy? What about column_view null count?
How is this any different than if a user is using libcudf from multiple threads today? All PTDS does it make it so each thread gets its own independent default stream.
If a user is modifying the _same_ libcudf objects from different threads then that is entirely their responsibility to make sure they are doing the right thing.
libcudf doesn't provide any multi-threaded testing today because libcudf doesn't care. PTDS doesn't change that.
So is our stance that any stateful libcudf objects aren't thread safe? You could imagine someone wanting to create a GroupBy object and then spew off a bunch of different aggregations against different columns across different streams.
So is our stance that any stateful libcudf objects aren't thread safe? You could imagine someone wanting to create a GroupBy object and then spew off a bunch of different aggregations against different columns across different streams.
Yes, without question. There is zero guarantee of thread safety when concurrently modifying the state of objects such as a column or groupby object.
This is no different than the fact that concurrently modifying a std::vector from multiple threads ~isn't safe~. (This is imprecise. There are certain situations where it is safe to modify a std::vector concurrently, e.g., you can safely modify v[i] and v[j] when i != j without a race. I was speaking more about operations that modify the state of the vector, such as concurrent push_back operations).
@revans2 @jlowe for perspective on spark internals and the plugin.
Mostly agree with @jrhemstad that we don't need to explicitly test libcudf with multiple threads, as Spark relies on task parallelism so it shouldn't need to access a cudf table concurrently. The spilling work @jlowe has been doing might make this more complicated, but I'll let him comment on that.
I do think we need to build the current code/tests with PTDS enabled, perhaps as part of the CI pipeline, to account for the subtle difference between per-thread and legacy default streams. This is similar to what Thrust is planning to do: https://github.com/thrust/thrust/issues/1132#issuecomment-647682961. If that causes too much overhead with the CI, maybe we should consider to have PTDS always enabled.
For RMM, we may want to test for memory allocated in one thread and freed in another, as it happens in Spark (@revans2 @jlowe can you confirm?).
For RMM, we may want to test for memory allocated in one thread and freed in another, as it happens in Spark (@revans2 @jlowe can you confirm?).
I can write separate RMM tests for this as part of my current PR.
@jrhemstad and I have been discussing offline, and @rongou's point about the cost of CI is a good one. Testing everything with PTDS would ~double CI cost. The good news is that we believe that PTDS is a strict superset of non-PTDS (legacy default stream). In other words, if tests pass in PTDS, then they should pass in non-PTDS. However the converse is not true -- tests may fail in PTDS that do not fail with legacy default stream.
So I propose that once RMM #425 is done, we switch CI to testing with PTDS enabled (even though it is not the default).
So I propose that once RMM #425 is done, we switch CI to testing with PTDS enabled (even though it is not the default).
We can't just flip that switch unless we bring the ecosystem along with us. This would likely break at least cuGraph and cuSpatial at the C++ level, and poses huge interoperability challenges for cuDF Python with Numba, CuPy, PyArrow, PyTorch, etc. at the Python level.
I think you misunderstand. We would only be flipping a switch to build and test with PTDS enabled. PTDS would still remain off by default in packages.
Assuming we are correct that PTDS is a strict superset (in terms of stream "safety") of legacy default stream mode, then testing PTDS is equivalent to testing both PTDS and legacy default stream mode.
The alternative is 4 hour CI time for every push.
We would only be flipping a switch to build and test with PTDS enabled. PTDS would still remain off by default in packages.
Yes, and testing the cuDF Python code with PTDS enabled is likely non-feasible as of now due to the use of cupy / numba internally in cuDF Python code. Due to this we'd need to build libcudf both with and without PTDS in order to run the C++ unit tests with PTDS and the cuDF Python unit tests without PTDS.
Since libcudf does not currently have a way to specify streams to its public APIs, it may be fine to combine PTDS-compiled libcudf with legacy-default-stream-compiled cupy/numba. We can try.
Sounds good to fire off a PR changing the CI GPU script to use a PTDS build to try it. If things don't work as expected I think we may be able to use Cython in combination with the cudaStreamPerThread (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g7b7129befd6f52708309acafd1c46197) definition to get an explicit stream to pass to our Numba / CuPy calls internally, but that won't necessarily be trivial.
I believe that combining a legacy-default stream library with a PTDS library in a process will only increase the amount of synchronization relative to the PTDS library alone. So there should be no need for changing the program (tests).
For RMM, we may want to test for memory allocated in one thread and freed in another, as it happens in Spark (@revans2 @jlowe can you confirm?).
Yes, this happens during the shuffle phase when task outputs are cached in GPU memory. A task thread allocates memory, writes the task output, then the thread returns to a thread pool to run another task. Sometime later another thread will come along and free that memory.
@jlowe @rongou tests for this added in rapidsai/rmm#425, please review: https://github.com/rapidsai/rmm/blob/26c2909e2c34aa362b55e9b4e51e4b7b7ff4d8eb/tests/mr/device/mr_multithreaded_tests.cpp#L297-L325
@rongou
does the spark-rapids need some change when the underling cuDF & rmm was build with per-thread default stream enabled?
in our env, the rmm & cudf was build with ptds=on, spark-rapids no change (version=0.1), it's seems everything goes well when we run TPC-DS query3, but since the nsight-systems do not work for now, we are not sure whether the ptds really works
by the way,
the thrust's default_stream() function in cuda-10.2 had been changed
The spark-rapids plugin itself doesn't need to change, but the cuDF jar should also be built with PTDS enabled. See https://github.com/rapidsai/cudf/blob/branch-0.15/java/README.md#per-thread-default-stream.
The Thrust change is incorporated into RMM/cuDF branch-0.15.
To generate a profile, this is what I do:
nvprof -o %p.prof --profile-child-processes --print-api-trace bash -c "\
/opt/spark/sbin/start-slave.sh spark://you-spark-master:7077; \
./your-job-script.sh; \
/opt/spark/sbin/stop-slave.sh"
This will create a profile named with the child process id (e.g. 10046.prof), which you can load into nvvp to see if the legacy default stream was still used.
The spark-rapids plugin itself doesn't need to change, but the cuDF jar should also be built with PTDS enabled. See https://github.com/rapidsai/cudf/blob/branch-0.15/java/README.md#per-thread-default-stream.
The Thrust change is incorporated into RMM/cuDF branch-0.15.
To generate a profile, this is what I do:
- Set up a Spark standalone cluster with GPU enabled.
- On one node, stop the slave, and run the following:
nvprof -o %p.prof --profile-child-processes --print-api-trace bash -c "\ /opt/spark/sbin/start-slave.sh spark://you-spark-master:7077; \ ./your-job-script.sh; \ /opt/spark/sbin/stop-slave.sh"This will create a profile named with the child process id (e.g.
10046.prof), which you can load intonvvpto see if the legacy default stream was still used.
nvprof/nvvp are deprecated. They won't work at all on a T4 system. Nsight Systems is the replacement.
The spark-rapids plugin itself doesn't need to change
If the RAPIDS Accelerator plugin is configured to use the UCX shuffle feature then changes to the plugin may be necessary. In that scenario shuffle output partitions are cached in GPU memory. With PTDS enabled, a GPU buffer will be produced by an upstream stage task thread on its own stream then consumed by a downstream stage task thread using a different stream. Without proper synchronization between those two streams, the downstream task theoretically could start consuming the partition data before kernels on the upstream stage task's stream have completed processing.
Nsight Systems works similarly:
nsys profile bash -c "\
/opt/spark/sbin/start-slave.sh spark://you-spark-master:7077; \
./your-job-script.sh; \
/opt/spark/sbin/stop-slave.sh"