Taichi: [TLS] Thread local storage for optimized reduction

Created on 10 Mar 2020 · 7Comments · Source: taichi-dev/taichi

Systematically resolving this issue is non-trivial. Maybe it's a good chance to add IR support for thread-local storage and scratchpad (shared) memory extension as well.

feature request

Source

yuanming-hu

👀2 🎉1

All 7 comments

Here's a good presentation on optimizing reductions on gpu :)
gpu_reduction.pdf

KLozes on 11 Mar 2020

@yuanming-hu Thank you for creating this issue!

One particularly important case of reduction happens on param gradient accumulation during backprop. Consider the code:

import taichi as ti
import math

ti.reset()
ti.init(arch=ti.cuda)

N = 2**20
a = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=False)
b = ti.Matrix(4, 4, dt=ti.f32, shape=(), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)
@ti.kernel
def f(w: ti.template()):
  for p in ti.grouped(x):
    y[p] = w@x[p]

Running the below benchmarks gives the following results (after the warm-up run):

%%time
for i in range(1000): f(a)
ti.sync()
# CPU times: user 98.4 ms, sys: 44.7 ms, total: 143 ms
# Wall time: 146 ms

Nice!

%%time
for i in range(1000): f.grad(a)
ti.sync()
# CPU times: user 240 ms, sys: 165 ms, total: 405 ms
# Wall time: 405 ms

Why is grad kernel slower? (although still pretty fast)

%%time
for i in range(1000): f.grad(b)
ti.sync()
# CPU times: user 18 s, sys: 12.6 s, total: 30.6 s
# Wall time: 30.4 s

Accumulating grad to b kills the performance.

znah on 25 Apr 2020

👍1

Thanks for pointing this out! The benchmark is very meaningful. On CPU, the atomic_adds are really slow, since we have to implement in it using atomicCAS + while loop. On GPU, hardware atomics are faster but atomic contention is still an issue.

Of course, the systematic solution is to add thread-local storage/shared memory to the IR, which I believe I'll have some time for in May...

yuanming-hu on 26 Apr 2020

👍1

As a temporary solution I managed to get a decent speed-up by reducing the contention with a number of copies of param matrix.

N = 2**20
D = 256
w = ti.Matrix(4, 4, dt=ti.f32, shape=(D,), needs_grad=True)
x = ti.Vector(4, dt=ti.f32, shape=N, needs_grad=True)
y = ti.Vector(4, dt=ti.f32, shape=N)

@ti.kernel
def f():
  for i in x:
    y[i] = w[0]@x[i]

@ti.kernel
def f_strided():
  for i in x:
    y[i] = w[i%D]@x[i]


%%time
for i in range(1000): f.grad()
ti.sync()
# CPU times: user 18.3 s, sys: 11.9 s, total: 30.2 s
# Wall time: 30.1 s

%%time
for i in range(1000): f_strided.grad()
ti.sync()
# CPU times: user 776 ms, sys: 462 ms, total: 1.24 s
# Wall time: 1.24 s

znah on 26 Apr 2020

Forgot to mention: benchmarks were run on Nvidia P100 CUDA

znah on 26 Apr 2020

👍1

An update from the Metal side when using SIMD reductions (equivalent to CUDA warp-level reductions).

There's no obvious perf difference when doing global reduction on integer atomics types. For float types, it was about 49x faster (for the particular case I benchmarked). I guess the current atomic add impl for floats are not that efficient:

https://github.com/taichi-dev/taichi/blob/d9c961688f836518c292e6ab4c818c1f9644e7d3/taichi/backends/metal/shaders/helpers.metal.h#L53-L67

The BM case is to sum 65536 floats. Here's the results:

naive:
- Samples: [0.056786, 0.052332, 0.054000, 0.054563, 0.051821, 0.051795, 0.052055, 0.052787, 0.052526, 0.052136]
- Avg: 0.0531s
SIMD:
- Samples: [0.004288, 0.000835, 0.000750, 0.000751, 0.000752, 0.000724, 0.000706, 0.000692, 0.000704, 0.000726]
- Avg: 0.00109s

k-ye on 23 Jun 2020

👍1

Very cool!

There's no obvious perf difference when doing global reduction on integer atomics types.

One possibility is that the Metal compiler already does the optimization for you on integer, since integer operations are associative. Note that the compiler is not allowed to change float-point operation order unless you use --fastmath, so programmers have to specify SIMD reduction manually.

yuanming-hu on 23 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings