Concisely describe the proposed feature
I think we can make the arg memories for a GPU kernel persistent. Moreover, we can have a circular buffer of these memories. These could help pipeline the GPU kernels with fewer synchronization calls.
However, note that this may be only relevant to Metal, since I noticed that the accumulated synchronization overhead in Metal is not small. Not sure if this is the case for CUDA or OpenGL...
The problem
This idea came from profiling mpm_langrangian_forces.py. I noticed that synchronize() is invoked in each iteration. This is because ti.Tape has two snode write kernels:
Even if these kernels are made asynchronous, with the current implementation, I think we will still have to run a synchronization at each iteration:
Describe the solution you'd like
A natural idea is to turn this arg memory into a circular buffer, and sync only when that buffer is full. This may be particularly useful for autodiff tasks -- Since the autodiff iterations are mostly write-only kernels, we don't really have to sync at each iteration?
I quickly hacked a circular buffer args for the Metal backend. There was some improvement, but not that much :-/ (Of couser the improvement also depends on the kernel characteristics. For this particular case, the atomic_add()s may also be a hotspot.)
BM settings:
import time
ti.sync()
t = time.time()
steps = 5000
for s in range(steps):
clear_grid()
with ti.Tape(total_energy):
compute_total_energy()
p2g()
grid_op()
g2p()
ti.sync()
duration = time.time() - t
print(f'{steps} steps total={duration} avg={duration / steps}')
Result:
5000 steps total=18.08519411087036 avg=0.0036170388221740723
metal Profiler
[100.00%] metal_synchronize min 0.366 ms avg 3.096 ms max 7.920 ms total 15.488 s [ 5002x]
5000 steps total=17.954378843307495 avg=0.003590875768661499
metal Profiler
[100.00%] metal_synchronize min 0.360 ms avg 3.062 ms max 5.403 ms total 15.316 s [ 5002x]
5000 steps total=18.10309410095215 avg=0.00362061882019043
metal Profiler
[100.00%] metal_synchronize min 0.356 ms avg 3.085 ms max 5.408 ms total 15.432 s [ 5002x]
5000 steps total=16.01917791366577 avg=0.003203835582733154
metal Profiler
[100.00%] metal_synchronize min 0.610 ms avg 19.685 ms max 23.812 ms total 14.094 s [ 716x]
5000 steps total=16.738796949386597 avg=0.0033477593898773193
metal Profiler
[100.00%] metal_synchronize min 1.285 ms avg 20.826 ms max 36.587 ms total 14.911 s [ 716x]
5000 steps total=16.424686193466187 avg=0.0032849372386932374
metal Profiler
[100.00%] metal_synchronize min 0.667 ms avg 20.350 ms max 24.623 ms total 14.571 s [ 716x]
So it's about 1.5s faster, or 8%...
Additional comments
Add any other context or screenshots about the feature request here.
Cool!
I can't agree more, synchronization overhead is ridiculous big on OpenGL backend... eg. mpm99's substep() itself <100ns, while argument copy in/out costs almost 200ns... So 70% (compared to 8% on Metal) of time is costed on mem sync... according to my taichi/perf.h counter. I had no idea how to deal with it, and here comes your solution :)
Sorry about the delay on my end. Yes, using a circular buffer to avoid synchronization is a great idea! Given that this feature is required by all the backends, let's figure out a way to do this in a maintainable manner.
I suggest we create a base class
class DeviceBufferManager {
protected:
...
public:
virtual void *allocate(std::size_t size) = 0;
virtual void *free(void *ptr) = 0; // assert first allocate, first free
}
Then we can have DeviceBufferManagerMetal, DeviceBufferManagerOpenGL, and DeviceBufferManagerCUDA that implements this interface for different backends.
In program/kernel.cpp we can just allocate these buffers through the abstract DeviceBufferManager. What do you guys think?
Then, how about allocate using std::vector? allocate and free here can't be automaticaly called when constructor/destructor called. We may want virtual DeviceBuffer allocate(std::size_t size) = 0 instead, where DeviceBuffer has its own destructor, and can be stored in taichi's internal somehow...
Also notice that some memories like root buffer are never shared between device and host. But argument buffer is created and destroied rapidly, so we may want DeviceBufferPool which didn't destroy the memory allocated last time when ~DeviceBuffer() called. I'm currently thinking about how this can improve OpenGL perf..
Then, how about allocate using std::vector? allocate and free here can't be automaticaly called when constructor/destructor called. We may want virtual DeviceBuffer allocate(std::size_t size) = 0 instead, where DeviceBuffer has its own destructor, and can be stored in taichi's internal somehow...
Yeah, we can also make use of RAII for more safety.
Also notice that some memories like root buffer are never shared between device and host. But argument buffer is created and destroied rapidly,
Right. Note that I only mentioned the args buffer in the issue precisely because of this, because we only need to cache the buffers whose data will be shared between host and device.
Given that this feature is required by all the backends, let's figure out a way to do this in a maintainable manner.
Yep. However, it's not the alloc/free that I'm worried, given that different backend may need different ways of allocation anyway. FYI what I did in the hack was just pre-allocate an array of fixed amount of arg buffers (e.g. 8).
What I do think should be unified is the code pattern below, launch_GPU_kernel(). Assuming we have a ArgsCircularBuffer interface:
void launch_GPU_kernel(Kernel* kernel, ArgsCircularBuffer<Whatever>* cb) {
auto args = cb->current_args();
kernel->bind(root);
kernel->bind(global_tmps);
kernel->bind(args);
kernel->run();
if (args->has_return_value() || cb->will_be_full()) {
synchronize();
cb->rewind();
} else {
// No synchronization required until |cb| is about to be full
cb->advance_to_next();
}
}
template <typename GPUArgs, int N = 8>
class ArgsCircularBuffer {
public:
GPUArgs& current_args() { return args_[i_]; }
bool will_be_full() const { return (i_ + 1) >= N; }
void rewind() { i_ = 0; }
void advance_to_next() { ++i_; }
private:
GPUArgs args[N];
int i_ = 0;
};
Most helpful comment
Right. Note that I only mentioned the args buffer in the issue precisely because of this, because we only need to cache the buffers whose data will be shared between host and device.
Yep. However, it's not the alloc/free that I'm worried, given that different backend may need different ways of allocation anyway. FYI what I did in the hack was just pre-allocate an array of fixed amount of arg buffers (e.g.
8).What I do think should be unified is the code pattern below,
launch_GPU_kernel(). Assuming we have aArgsCircularBufferinterface: