Systematically resolving this issue is non-trivial. Maybe it's a good chance to add IR support for thread-local storage and scratchpad (shared) memory extension as well.
Here's a good presentation on optimizing reductions on gpu :)
gpu_reduction.pdf
@yuanming-hu Thank you for creating this issue!
One particularly important case of reduction happens on param gradient accumulation during backprop. Consider the code:
import taichi as ti
import math
ti.reset()
ti.init(arch=ti.cuda)
N = 2**20
a = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=False)
b = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)
@ti.kernel
def f(w: ti.template()):
for p in ti.grouped(x):
y[p] = w@x[p]
Running the below benchmarks gives the following results (after the warm-up run):
%%time
for i in range(1000): f(a)
ti.sync()
# CPU times: user 98.4 ms, sys: 44.7 ms, total: 143 ms
# Wall time: 146 ms
Nice!
%%time
for i in range(1000): f.grad(a)
ti.sync()
# CPU times: user 240 ms, sys: 165 ms, total: 405 ms
# Wall time: 405 ms
Why is grad kernel slower? (although still pretty fast)
%%time
for i in range(1000): f.grad(b)
ti.sync()
# CPU times: user 18 s, sys: 12.6 s, total: 30.6 s
# Wall time: 30.4 s
Accumulating grad to b kills the performance.
Thanks for pointing this out! The benchmark is very meaningful. On CPU, the atomic_adds are really slow, since we have to implement in it using atomicCAS + while loop. On GPU, hardware atomics are faster but atomic contention is still an issue.
Of course, the systematic solution is to add thread-local storage/shared memory to the IR, which I believe I'll have some time for in May...
As a temporary solution I managed to get a decent speed-up by reducing the contention with a number of copies of param matrix.
N = 2**20
D = 256
w = ti.Matrix(4, 4, dt=ti.f32, shape=(D,), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)
@ti.kernel
def f():
for i in x:
y[i] = w[0]@x[i]
@ti.kernel
def f_strided():
for i in x:
y[i] = w[i%D]@x[i]
%%time
for i in range(1000): f.grad()
ti.sync()
# CPU times: user 18.3 s, sys: 11.9 s, total: 30.2 s
# Wall time: 30.1 s
%%time
for i in range(1000): f_strided.grad()
ti.sync()
# CPU times: user 776 ms, sys: 462 ms, total: 1.24 s
# Wall time: 1.24 s
Forgot to mention: benchmarks were run on Nvidia P100 CUDA
An update from the Metal side when using SIMD reductions (equivalent to CUDA warp-level reductions).
There's no obvious perf difference when doing global reduction on integer atomics types. For float types, it was about 49x faster (for the particular case I benchmarked). I guess the current atomic add impl for floats are not that efficient:
The BM case is to sum 65536 floats. Here's the results:
0.0531s0.00109sVery cool!
There's no obvious perf difference when doing global reduction on integer atomics types.
One possibility is that the Metal compiler already does the optimization for you on integer, since integer operations are associative. Note that the compiler is not allowed to change float-point operation order unless you use --fastmath, so programmers have to specify SIMD reduction manually.