Taichi: Have a circular buffer of args memories for GPU kernels

Created on 27 Feb 2020  路  6Comments  路  Source: taichi-dev/taichi

Concisely describe the proposed feature

I think we can make the arg memories for a GPU kernel persistent. Moreover, we can have a circular buffer of these memories. These could help pipeline the GPU kernels with fewer synchronization calls.

However, note that this may be only relevant to Metal, since I noticed that the accumulated synchronization overhead in Metal is not small. Not sure if this is the case for CUDA or OpenGL...

The problem

This idea came from profiling mpm_langrangian_forces.py. I noticed that synchronize() is invoked in each iteration. This is because ti.Tape has two snode write kernels:

https://github.com/taichi-dev/taichi/blob/7bbff2c6d07633108cdf5314d28fe7c112b8da59/python/taichi/lang/__init__.py#L161-L162

Even if these kernels are made asynchronous, with the current implementation, I think we will still have to run a synchronization at each iteration:

Describe the solution you'd like

A natural idea is to turn this arg memory into a circular buffer, and sync only when that buffer is full. This may be particularly useful for autodiff tasks -- Since the autodiff iterations are mostly write-only kernels, we don't really have to sync at each iteration?

I quickly hacked a circular buffer args for the Metal backend. There was some improvement, but not that much :-/ (Of couser the improvement also depends on the kernel characteristics. For this particular case, the atomic_add()s may also be a hotspot.)

BM settings:

  import time
  ti.sync()
  t = time.time()
  steps = 5000
  for s in range(steps):
    clear_grid()
    with ti.Tape(total_energy):
      compute_total_energy()
    p2g()
    grid_op()
    g2p()
  ti.sync()
  duration = time.time() - t
  print(f'{steps} steps total={duration} avg={duration / steps}')

Result:

  • No change
5000 steps total=18.08519411087036 avg=0.0036170388221740723
metal Profiler
[100.00%] metal_synchronize                            min   0.366 ms   avg   3.096 ms    max   7.920 ms   total  15.488 s [   5002x]

5000 steps total=17.954378843307495 avg=0.003590875768661499
metal Profiler
[100.00%] metal_synchronize                            min   0.360 ms   avg   3.062 ms    max   5.403 ms   total  15.316 s [   5002x]

5000 steps total=18.10309410095215 avg=0.00362061882019043
metal Profiler
[100.00%] metal_synchronize                            min   0.356 ms   avg   3.085 ms    max   5.408 ms   total  15.432 s [   5002x]
  • circular_buffer=8
5000 steps total=16.01917791366577 avg=0.003203835582733154
metal Profiler
[100.00%] metal_synchronize                            min   0.610 ms   avg  19.685 ms    max  23.812 ms   total  14.094 s [    716x]

5000 steps total=16.738796949386597 avg=0.0033477593898773193
metal Profiler
[100.00%] metal_synchronize                            min   1.285 ms   avg  20.826 ms    max  36.587 ms   total  14.911 s [    716x]

5000 steps total=16.424686193466187 avg=0.0032849372386932374
metal Profiler
[100.00%] metal_synchronize                            min   0.667 ms   avg  20.350 ms    max  24.623 ms   total  14.571 s [    716x]

So it's about 1.5s faster, or 8%...

Additional comments
Add any other context or screenshots about the feature request here.

enhancement metal

Most helpful comment

Also notice that some memories like root buffer are never shared between device and host. But argument buffer is created and destroied rapidly,

Right. Note that I only mentioned the args buffer in the issue precisely because of this, because we only need to cache the buffers whose data will be shared between host and device.

Given that this feature is required by all the backends, let's figure out a way to do this in a maintainable manner.

Yep. However, it's not the alloc/free that I'm worried, given that different backend may need different ways of allocation anyway. FYI what I did in the hack was just pre-allocate an array of fixed amount of arg buffers (e.g. 8).

What I do think should be unified is the code pattern below, launch_GPU_kernel(). Assuming we have a ArgsCircularBuffer interface:

void launch_GPU_kernel(Kernel* kernel, ArgsCircularBuffer<Whatever>* cb) {
  auto args = cb->current_args();
  kernel->bind(root);
  kernel->bind(global_tmps);
  kernel->bind(args);
  kernel->run();
  if (args->has_return_value() || cb->will_be_full()) {
    synchronize();
    cb->rewind();
  } else {
    // No synchronization required until |cb| is about to be full
    cb->advance_to_next();
  }
}

template <typename GPUArgs, int N = 8>
class ArgsCircularBuffer {
 public:
  GPUArgs& current_args() { return args_[i_]; }
  bool will_be_full() const { return (i_ + 1) >= N; }
  void rewind() { i_ = 0; }
  void advance_to_next() { ++i_; }
 private:
  GPUArgs args[N];
  int i_ = 0;
};

All 6 comments

Cool!
I can't agree more, synchronization overhead is ridiculous big on OpenGL backend... eg. mpm99's substep() itself <100ns, while argument copy in/out costs almost 200ns... So 70% (compared to 8% on Metal) of time is costed on mem sync... according to my taichi/perf.h counter. I had no idea how to deal with it, and here comes your solution :)

Sorry about the delay on my end. Yes, using a circular buffer to avoid synchronization is a great idea! Given that this feature is required by all the backends, let's figure out a way to do this in a maintainable manner.

I suggest we create a base class

class DeviceBufferManager {
protected:
  ...

public:
  virtual void *allocate(std::size_t size) = 0;
  virtual void *free(void *ptr) = 0;   // assert first allocate, first free
}

Then we can have DeviceBufferManagerMetal, DeviceBufferManagerOpenGL, and DeviceBufferManagerCUDA that implements this interface for different backends.

In program/kernel.cpp we can just allocate these buffers through the abstract DeviceBufferManager. What do you guys think?

Then, how about allocate using std::vector? allocate and free here can't be automaticaly called when constructor/destructor called. We may want virtual DeviceBuffer allocate(std::size_t size) = 0 instead, where DeviceBuffer has its own destructor, and can be stored in taichi's internal somehow...

Also notice that some memories like root buffer are never shared between device and host. But argument buffer is created and destroied rapidly, so we may want DeviceBufferPool which didn't destroy the memory allocated last time when ~DeviceBuffer() called. I'm currently thinking about how this can improve OpenGL perf..

Then, how about allocate using std::vector? allocate and free here can't be automaticaly called when constructor/destructor called. We may want virtual DeviceBuffer allocate(std::size_t size) = 0 instead, where DeviceBuffer has its own destructor, and can be stored in taichi's internal somehow...

Yeah, we can also make use of RAII for more safety.

Also notice that some memories like root buffer are never shared between device and host. But argument buffer is created and destroied rapidly,

Right. Note that I only mentioned the args buffer in the issue precisely because of this, because we only need to cache the buffers whose data will be shared between host and device.

Given that this feature is required by all the backends, let's figure out a way to do this in a maintainable manner.

Yep. However, it's not the alloc/free that I'm worried, given that different backend may need different ways of allocation anyway. FYI what I did in the hack was just pre-allocate an array of fixed amount of arg buffers (e.g. 8).

What I do think should be unified is the code pattern below, launch_GPU_kernel(). Assuming we have a ArgsCircularBuffer interface:

void launch_GPU_kernel(Kernel* kernel, ArgsCircularBuffer<Whatever>* cb) {
  auto args = cb->current_args();
  kernel->bind(root);
  kernel->bind(global_tmps);
  kernel->bind(args);
  kernel->run();
  if (args->has_return_value() || cb->will_be_full()) {
    synchronize();
    cb->rewind();
  } else {
    // No synchronization required until |cb| is about to be full
    cb->advance_to_next();
  }
}

template <typename GPUArgs, int N = 8>
class ArgsCircularBuffer {
 public:
  GPUArgs& current_args() { return args_[i_]; }
  bool will_be_full() const { return (i_ + 1) >= N; }
  void rewind() { i_ = 0; }
  void advance_to_next() { ++i_; }
 private:
  GPUArgs args[N];
  int i_ = 0;
};
Was this page helpful?
0 / 5 - 0 ratings