Taichi: [Perf] Comparing Taichi with Numba

Created on 10 Jun 2020  路  3Comments  路  Source: taichi-dev/taichi

A zhihu user commented at https://zhuanlan.zhihu.com/p/145222094 shows that using numba is much faster than taichi on GPUs for my calc_pi example. Also it's heard that numba support CUDA at some degree too.
Not sure if this also apply to other applications. If we can reproduce this performance de-boost on other examples, then that may warn us that we may lose users go for numba for python-embbed parallel computation.
Lack of numba knowledges, I failed to make a numba version for simple_uv.py.
Here's a numpy-only version:

import taichi as ti
import numpy as np

res = 1280, 720


def paint():
    a = np.linspace(0, 1, res[1])
    b = np.linspace(0, 1, res[0])
    a, b = np.meshgrid(a, b)
    c = np.zeros((*res, 1))
    a = a.reshape((*res, 1))# + c
    b = b.reshape((*res, 1))# + c
    return np.concatenate((a, b, c), axis=2)


gui = ti.GUI('UV', res)
while not gui.get_event(ti.GUI.ESCAPE):
    pixels = paint()
    gui.set_image(pixels)
    gui.show()

gets ~ 32 fps on my machine, while Taichi/x64 gets ~ 51 fps, Taichi/OpenGL gets ~ 34 fps (because of copying overhead).
For numba parallelization example, see https://github.com/numba/numba/issues/3336.
For numba docs, see http://numba.pydata.org/numba-doc/latest.
An article about numba: https://www.jianshu.com/p/69d9d7e37bc5.

Most helpful comment

Yeah I wouldn't too much about that - we are adding Thread Local Storage IR to address the reduction performance issue very soon.

Also, TBH, we haven't done a systematic performance study after switching to LLVM - there's a lot of space for performance improvements...

All 3 comments

I am also curious about the advantage of taichi compared to numba.
As far as I know, taichi supports sparse computation, which is common and useful in simulation. But for now most examples don't seem to use sparse computation.

I am not too surprised by this. The calc_pi example is doing a lot of atomic adds, which are really slow in Taichi right now. Run the profiler on the mgpcg example and you'll see the at the reductions taking very very long. We need some thread local/shared memory optimizations to make these faster.

Yeah I wouldn't too much about that - we are adding Thread Local Storage IR to address the reduction performance issue very soon.

Also, TBH, we haven't done a systematic performance study after switching to LLVM - there's a lot of space for performance improvements...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

archibate picture archibate  路  3Comments

yuanming-hu picture yuanming-hu  路  4Comments

kigawas picture kigawas  路  4Comments

archibate picture archibate  路  4Comments

kazimuth picture kazimuth  路  4Comments