Taichi: [CUDA] [Perf] The performance of cornell_box.py on CUDA is still poor than OpenGL

Created on 30 Jul 2020 · 3Comments · Source: taichi-dev/taichi

Describe the bug
Seems CUDA somehow still running slow compared to OpenGL in cornell_box.py.
It's likely not related to ti.random() otherwise #1419 should already fix this.

To Reproduce
Examples that CUDA > OpenGL:
Run examples/stable_fluid.py: 38 fps on CUDA, and 35 fps on NVIDIA OpenGL
Run examples/sdf_renderer.py: 8.6 sps on CUDA, and 4.9 sps on NVIDIA OpenGL

Examples that OpenGL > CUDA:
Run examples/cornell_box.py: 13.7 sps on CUDA, and 33.4 sps on NVIDIA OpenGL
Run examples/mpm99.py of quality = 4: 2.2 fps on CUDA, and 2.6 fps on NVIDIA OpenGL
Run examples/mpm3d.py in #1639: 12 fps in CUDA, 20 fps on both NVIDIA OpenGL and mesa OpenGL

It seems to me that examples/mpm3d.py is memory-bound since both NVIDIA OpenGL and mesa OpenGL gets very same FPS. So I guess there might be some issue in CUDA's memory caching mechanism causing peak memory bandwidth drop?
However IMO examples/stable_fluid.py is also very memory-bound, why CUDA is no worse than OpenGL there?

Log/Screenshots

(gui-lines) [bate@archit taichi]$ p examples/cornell_box.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-zouem0bw
[Taichi] <dev mode>, llvm 10.0.0, commit 9441aa59, python 3.8.3
[Taichi] Starting on arch=cuda
1.00 samples/s (10 iters, var=0.17095622420310974)
13.75 samples/s (20 iters, var=0.16237077116966248)
13.75 samples/s (30 iters, var=0.15682613849639893)
13.75 samples/s (40 iters, var=0.15285052359104156)
13.74 samples/s (50 iters, var=0.1497148871421814)
13.78 samples/s (60 iters, var=0.14720608294010162)
13.78 samples/s (70 iters, var=0.14517459273338318)
13.76 samples/s (80 iters, var=0.14345137774944305)
13.78 samples/s (90 iters, var=0.14200721681118011)
13.79 samples/s (100 iters, var=0.14074094593524933)
13.77 samples/s (110 iters, var=0.1396736055612564)
13.74 samples/s (120 iters, var=0.13870856165885925)
^CTraceback (most recent call last):
  File "examples/cornell_box.py", line 454, in <module>
    img = color_buffer.to_numpy() * (1 / (i + 1))
  File "/home/bate/Develop/taichi/python/taichi/lang/util.py", line 199, in wrapped
    return func(*args, **kwargs)
  File "/home/bate/Develop/taichi/python/taichi/lang/matrix.py", line 678, in to_numpy
    matrix_to_ext_arr(self, ret, as_vector)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 541, in wrapped
    return primal(*args, **kwargs)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 470, in __call__
    return self.compiled_functions[key](*args)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 433, in func__
    t_kernel()
KeyboardInterrupt

(gui-lines) [bate@archit taichi]$ aop examples/cornell_box.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-x_dcopaf
[Taichi] <dev mode>, llvm 10.0.0, commit 9441aa59, python 3.8.3
[I 07/30/20 09:06:22.175] [__init__.py:init@200] Following TI_ARCH setting up for arch=opengl
[Taichi] Starting on arch=opengl
1.23 samples/s (10 iters, var=0.17092570662498474)
32.99 samples/s (20 iters, var=0.1623740792274475)
33.03 samples/s (30 iters, var=0.15684311091899872)
33.24 samples/s (40 iters, var=0.15290534496307373)
33.43 samples/s (50 iters, var=0.1498454213142395)
33.53 samples/s (60 iters, var=0.14739716053009033)
33.37 samples/s (70 iters, var=0.1453833431005478)
33.37 samples/s (80 iters, var=0.14364846050739288)
33.33 samples/s (90 iters, var=0.14219777286052704)
33.39 samples/s (100 iters, var=0.1409151256084442)
33.44 samples/s (110 iters, var=0.13981790840625763)
33.35 samples/s (120 iters, var=0.1388426572084427)
33.44 samples/s (130 iters, var=0.13798461854457855)
33.41 samples/s (140 iters, var=0.13721933960914612)
33.42 samples/s (150 iters, var=0.1365472376346588)
33.52 samples/s (160 iters, var=0.13591784238815308)
33.32 samples/s (170 iters, var=0.13537169992923737)
33.34 samples/s (180 iters, var=0.13485784828662872)
33.59 samples/s (190 iters, var=0.13438907265663147)
(gui-lines) [bate@archit taichi]$ p examples/sdf_renderer.py   
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-s3lxy9zp
[Taichi] <dev mode>, llvm 10.0.0, commit 9441aa59, python 3.8.3
[Taichi] Starting on arch=cuda
0.00 samples/s
7.74 samples/s
8.55 samples/s
8.60 samples/s
8.66 samples/s
8.65 samples/s
8.66 samples/s
^C^CTraceback (most recent call last):
  File "examples/sdf_renderer.py", line 160, in <module>
    img = color_buffer.to_numpy() * (1 / (i + 1))
  File "/home/bate/Develop/taichi/python/taichi/lang/util.py", line 199, in wrapped
    return func(*args, **kwargs)
  File "/home/bate/Develop/taichi/python/taichi/lang/matrix.py", line 678, in to_numpy
    matrix_to_ext_arr(self, ret, as_vector)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 541, in wrapped
    return primal(*args, **kwargs)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 470, in __call__
    return self.compiled_functions[key](*args)
  File "/home/bate/Develop/taichi/python/taichi/lang/kernel.py", line 433, in func__
    t_kernel()
KeyboardInterrupt

(gui-lines) [bate@archit taichi]$ aop examples/sdf_renderer.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-wsbkfd9t
[Taichi] <dev mode>, llvm 10.0.0, commit 9441aa59, python 3.8.3
[I 07/30/20 09:10:53.170] [__init__.py:init@200] Following TI_ARCH setting up for arch=opengl
[Taichi] Starting on arch=opengl
0.00 samples/s
4.35 samples/s
4.88 samples/s
4.91 samples/s
4.91 samples/s
4.91 samples/s
^C[ 4454.497841] [WARN]Received Interrupt signal.

If you have local commits (e.g. compile fixes before you reproduce the bug), please make sure you first make a PR to fix the build errors and then report the bug.
Also note that currently kernel_profiler doesn't work well on OpenGL due its the lack of support for ti.sync().

potential bug

Source

archibate

All 3 comments

As the custodian of the Taichi developer community culture, I'm closing this issue given the description is very rude. Please keep in mind that no one will be willing to help on an issue/review a PR if the author doesn't speak politely. Specifically, please keep in mind

No one in the community is obligated to contribute. So please ask nicely if you want people to help implement/review anything.
No need to be rude. Refrain from using "Why?? I'm sure I didn't blind my eye?".
There's no need to use double question marks in any case. That's extremely aggressive.
Be polite and positive.

At a high level, communicate politely and effectively is as important as having things done.

Feel free to open another issue with the wording issues addressed.

yuanming-hu on 9 Aug 2020

👍1

@archibate Thanks. It looks much better now. Keep in mind being polite and friendly makes other people more willing to respond to you.

I'll take a look at this soon.

yuanming-hu on 9 Aug 2020

🎉1

I took a quick look - there doesn't seem to be a clear clue why CUDA is slower than OpenGL on cornell_box. Given there are both cases where CUDA faster and where OpenGL is after, this doesn't mean something is systematically wrong in CUDA. We should revisit this later.

yuanming-hu on 22 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Perf] Comparing Taichi with Numba

archibate · 3Comments

[Lang] tensor as local temporary variable / do we have tensor slice support?

archibate · 4Comments

runtime [verify.cpp:basic_verify@39] stmt 5249 cannot have operand 4663

zdxpan · 3Comments

[Bug]"ti.sin()" doesn't work with a local variable.

Xayahp · 3Comments

NAN check and throw exception on debug mode

g1n0st · 3Comments