Taichi: Slow initialization/compilation time

Created on 2 Mar 2020  路  12Comments  路  Source: taichi-dev/taichi

Describe the issue
As we add more and more stuff to Taichi, it seems that even a minimal taichi program now takes 10s to run. This harms user/developer experience. Also the tests can take ~30 minutes to finish on our build bots... :-(

To Reproduce
I added misc/minimal_timed.py as a benchmark:

import taichi as ti
import time

t = time.time()
ti.init(arch=ti.x64)

@ti.kernel
def p():
  print(42)

p()

print(f'{time.time() - t:.3f} s')

Log/Screenshots
The above program prints out 9.7s for x64 and 11.0s for cuda. When not compiled with CUDA, it may run slightly faster. Ideally this program should take < 0.3s.

welcome contribution

Most helpful comment

Update: Since now CI takes a ridiculously long time (40 minutes), I spent today optimizing the JIT compilation speed on LLVM backends. As pointed out by @k-ye the global optimization is the bottleneck. I looked into the LLVM IR and found that most time has been wasted on optimizing/compiling functions that we don't even use. So I added a few passes to rip the useless functions off before sending everything to LLVM.

This improves LLVM JIT compilation speed by a factor of ~10 and now ti test runs 6x faster on x64 backends. I also tested CUDA (via LLVM NVPTX) and the speedup is roughly the same.

Finally I think we can close this now.

All 12 comments

For X64, I can do some simple profiling to pinpoint which block is the bottleneck..

Good point, I found startup time is becoming more and more long during these days' develop...

Good point, I found startup time is becoming more and more long during these days' develop...

Yeah, sadly it's a gradual increase instead of a sudden one, according to Travis/AppVeyor. This means there could be many sources of the slowness...

Running the given kernel on my side took a total of about 1.10s (TL;DR; Buy a more powerful PC...) A (likely not so accurate) profiling showed these to be the hotspots during kernel compilation:


  1. https://github.com/taichi-dev/taichi/blob/9ebe657831408627a80e8e6afc541077e73e9d53/taichi/jit/jit_arch_cpu.cpp#L269

This took a total of about 0.75s (68.2%). It's been invoked twice, once for compiling the structs, the other for the actual kernel. Both took about 0.37s.

I tried to tune down this to -O1, which helped reduce the time to ~0.31s for each call. But this remains a bottleneck.

Not sure if it could be improved further by compiling these two together. I.e. is the time coming from optimizing LLVM IR itself, or some hidden overhead running this optimization?


  1. https://github.com/taichi-dev/taichi/blob/9ebe657831408627a80e8e6afc541077e73e9d53/taichi/llvm/llvm_context.cpp#L134-L145

This took about 0.26s (23.8%).

Hmmm, may be a good time to turn JIT into AOT 馃槀

Switched to another PC.

Compiled with TI_WITH_CUDA=ON

x64:  4.577 s
cuda: 5.206 s

(note that in this case even for x64 the CUDA context will be created.)

Compiled with TI_WITH_CUDA=OFF

x64: 0.790 s

It seems to me that CUDA context initialization takes quite a few seconds, so I removed CUDA context initialization when using CPUs. Now at least the CPU version finishes within 0.8s.

Thanks for profiling the x64 backend! That helped we understand the time consumption much better :-)

I tried to tune down this to -O1, which helped reduce the time to ~0.31s for each call. But this remains a bottleneck.

Not sure if it could be improved further by compiling these two together. I.e. is the time coming from optimizing LLVM IR itself, or some hidden overhead running this optimization?

I guess the time is just spent on LLVM IR optimization. Unfortunately, I can't think of an easy way to effectively reduce this part :-(

"runtime module bitcode compilation" will only happen in development mode, once per Taichi instance, so I wouldn't worry too much about that.

Btw, can you test this against different version? Recently even launching snode accessors kernel take seconds to complete each time... after I merged master into ast. What's new there?

I'm not sure what version you're talking about or any repro code. Maybe you can figure this out by adding a few get_current_program().profiler_start(), [...].profiler_stop() around snode accessors.

This interested me! Want to take a look at this issue. Is this the ti.profiler_start() in python-scope? Seems not showing anything.

Yep, ti.profiler_start() is exposed to Python:

https://github.com/taichi-dev/taichi/blob/1dbd05a0b17d39193f51220be648673fd65be6d7/taichi/python/export_lang.cpp#L116

However, If you need to profile a sub section of CPP's function, then I believe you still need to use Program::profiler_start() in the CPP scope.

Update: Since now CI takes a ridiculously long time (40 minutes), I spent today optimizing the JIT compilation speed on LLVM backends. As pointed out by @k-ye the global optimization is the bottleneck. I looked into the LLVM IR and found that most time has been wasted on optimizing/compiling functions that we don't even use. So I added a few passes to rip the useless functions off before sending everything to LLVM.

This improves LLVM JIT compilation speed by a factor of ~10 and now ti test runs 6x faster on x64 backends. I also tested CUDA (via LLVM NVPTX) and the speedup is roughly the same.

Finally I think we can close this now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuanming-hu picture yuanming-hu  路  3Comments

jackalcooper picture jackalcooper  路  4Comments

GeoffreyPlitt picture GeoffreyPlitt  路  4Comments

yuanming-hu picture yuanming-hu  路  3Comments

liaopeiyuan picture liaopeiyuan  路  3Comments