Taichi: Starting multiple Taichi instances causes bus error under development mode

Created on 19 Feb 2020  路  15Comments  路  Source: taichi-dev/taichi

Describe the bug
Starting multiple Taichi instances simultaneously causes Fatal Python error: Bus error
I.e., see https://github.com/taichi-dev/taichi/issues/481#issuecomment-586720382

To Reproduce
This almost always happens when we use multithreaded testing with >= 4 threads. (Clearly, the more threads, the higher crashing probability.)

Cause
In development mode, taichi will create a copy of build/libtaichi_core.so into build/taichi_core.so. This is to ensure writing into build/libtaichi_core.so (i.e. when you are compiling Taichi itself) does not crash any running taichi instances, which depends on build/taichi_core.so.

However, when starting two taichi instances, they might fight with each other. Specifically, instance A is trying to import build/taichi_core.so, yet instance B is removing the current build/taichi_core.so and creating its own version. This causes the shared object being loaded by instance A deleted, and a bus error.

How to fix
Create a folder for each process, with folder name being process id + current time + a random number etc, so that each taichi instance has a different sandbox for build/taichi_core.so. You'll have to modify here

https://github.com/taichi-dev/taichi/blob/97dbf64f735598e64dc13690e5f237dedf20f091/python/taichi/core/util.py#L178

Actually, this has been a known issue for a long time, but I totally forget...:
https://github.com/taichi-dev/taichi/blob/97dbf64f735598e64dc13690e5f237dedf20f091/python/taichi/core/util.py#L204

bug welcome contribution

Most helpful comment

To chime in, the Metal backend also doesn't generate any temporary files. The solution is similar to the OpengGL backend -- Apple has this newLibraryWithSource:options:completionHandler: API https://developer.apple.com/documentation/metal/mtldevice/1433351-newlibrarywithsource()

All 15 comments

We can copy that to /tmp/v1SOXPB/taichi_core.so. To do this, you want to import tempfile.

Alternate solution 1:
Instead of making another taichi_core.so copy, we add write-protection to libtaichi_core.so.

Alternate solution 2:
Instead of making a lot of taichi_core.so copy for each instance with same contents, we just make one taichi_core.so. And not to replace it until timestamp(libtaichi_core.so) > timestamp(taichi_core.so). This also helps taichi startup more quicker.

Using file timestamp is a great idea to avoid unnecessary copies! We will need some file locking for safety though.

I think solution 1 better. Even better if you remove this feature, since no people want to build while test running. If they want, changes should be in Makefiles instead of taichi. Also note that rewriteing meanless data again and again is not friendly to SSD user like me.

Instead of making another taichi_core.so copy, we add write-protection to libtaichi_core.so.

Actually, this might affect the developer experience. Adding write-protection to libtaichi_core.so means that you can't compile taichi while testing...

We can compile of course, just can't link. During the write-protection of libtaichi_core.so, all the compile would be success, except the last step, when their test could have been done.
Also, they could use ninja -C build $(find CMakeFiles/taichi_core.dir).

Taichi can have multiple instances, while build can have only one. So it's better and easier to do protection work in build instead of taichi core.

A new problem raised from #501: too many /tmp/taichi-* is making my /tmp full!!!
We want to use atexit.register(lambda: os.unlink(tmp_dir))!

Maybe related. Running a Taichi application with mpirun fails most of the time since multiple ranks are trying to compile and load the same runtime file. Could we move this inside the sandbox as well?

running: mpirun -np 4 python3 laplace.py

leads to:

[taichi] prepared sandbox at /tmp/taichi-3odr7ws9
[taichi] prepared sandbox at /tmp/taichi-7c93liyx
[taichi] prepared sandbox at /tmp/taichi-bm011fe0
[taichi] prepared sandbox at /tmp/taichi-lcitw8x2
[Taichi version 0.5.2, cpu only, commit a8490052]
[Taichi version 0.5.2, cpu only, commit a8490052]
[Taichi version 0.5.2, cpu only, commit a8490052]
[Taichi version 0.5.2, cpu only, commit a8490052]
[W 02/22/20 08:46:11.554] [taichi_llvm_context.cpp:module_from_bitcode_file@170] Bitcode loading error message:
[E 02/22/20 08:46:11.554] [taichi_llvm_context.cpp:module_from_bitcode_file@172] Bitcode /home/klozes/Documents/software/taichi/taichi/runtime//runtime_x86_64.bc load failure.
[E 02/22/20 08:46:11.554] Received signal 6 (Aborted)
Invalid bitcode signature
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f367b02cf20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::module_from_bitcode_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::LLVMContext*)
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::TaichiLLVMContext::clone_runtime_module()
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::TaichiLLVMContext::get_init_module()
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::StructCompilerLLVM::StructCompilerLLVM(taichi::Tlang::Program*, taichi::Tlang::Arch)
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::StructCompiler::make(bool, taichi::Tlang::Program*, taichi::Tlang::Arch)
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::Program::materialize_layout()
/tmp/taichi-3odr7ws9/taichi_core.so: taichi::Tlang::layout(std::function<void ()> const&)
/tmp/taichi-3odr7ws9/taichi_core.so(+0x84db29) [0x7f3654fc9b29]
/tmp/taichi-3odr7ws9/taichi_core.so(+0x63e484) [0x7f3654dba484]
python3() [0x50abc5]
python3(_PyEval_EvalFrameDefault+0x449) [0x50c549]
python3() [0x5081d5]
python3() [0x50a020]
python3() [0x50aa1d]
python3(_PyEval_EvalFrameDefault+0x449) [0x50c549]
python3(_PyFunction_FastCallDict+0xf5) [0x5093e5]
python3() [0x5951c1]
python3(PyObject_Call+0x3e) [0x5a04ce]
python3() [0x557878]
python3() [0x541d40]
python3(_PyEval_EvalFrameDefault+0xed8) [0x50cfd8]
python3() [0x5081d5]
python3(PyEval_EvalCode+0x23) [0x50b3a3]
python3() [0x635082]
python3(PyRun_FileExFlags+0x97) [0x635137]
python3(PyRun_SimpleFileExFlags+0x17f) [0x6388ef]
python3(Py_Main+0x591) [0x639491]
python3(main+0xe0) [0x4b0f60]
/lib/x86_64-linux-gnu/libc.so.6: __libc_start_main
python3(_start+0x2a) [0x5b2eaa]

Yeah, I also find the bitcode compilation to be problematic when multiple instances start (under development mode only). We should consider moving this to the tmp dir as well. An easy solution is to pass in the tmpdir generated by Python via pybind11 (set_tmp_dir(std::string)) and save that value in CoreState. Then when we compiler the runtime bitcode just use that tmp_dir.

Working on this now. PR opened at #517.

Thanks archibate. Actually, I think source-to-source backends are also still going to run into problems since they emit code to temp files. These should also be moved to the temp dir. It should be easy to pass the temp dir path over to CodeGenBase.

Thanks for your suggestion! Yeah all runtime generated files should be put in sandbox. But to OpenGL backend, that's not a problem: Some backends invokes a program to compile, while OpenGL use an API called glShaderSource to directly take const char *src as argument and no temp file is needed! But I don't know if other backends have used temp files, any idea? @yuanming-hu

To chime in, the Metal backend also doesn't generate any temporary files. The solution is similar to the OpengGL backend -- Apple has this newLibraryWithSource:options:completionHandler: API https://developer.apple.com/documentation/metal/mtldevice/1433351-newlibrarywithsource()

I believe this is resolved, thanks to the great efforts by @archibate.

(Starting too many instances might still lead to out-of-memory issues on GPUs, which has no obvious solutions..)

Was this page helpful?
0 / 5 - 0 ratings