Taichi: pip-installed Taichi crashes on Google colab kernels

Created on 30 Oct 2019  Â·  59Comments  Â·  Source: taichi-dev/taichi

Opening an empty CPU-backed notebook at https://colab.research.google.com and running the following code leads to crash:

!apt install clang-7
!apt install clang-format
!pip install taichi-nightly
import taichi as ti

x, y = ti.var(ti.f32), ti.var(ti.f32)

@ti.layout
def xy():
  ti.root.dense(ti.ij, 16).place(x, y)

@ti.kernel
def laplace():
  for i, j in x:
    if (i + j) % 3 == 0:
      y[i, j] = 4.0 * x[i, j] - x[i - 1, j] - x[i + 1, j] - x[i, j - 1] - x[i, j + 1]
    else:
      y[i, j] = 0.0

for i in range(10):
 x[i, i + 1] = 1.0

laplace()

for i in range(10):
  print(y[i, i + 1])



md5-19a8d122628d5b5b2a737a13f1f509fc



Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::operator()()
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::compile()
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower_cpp()
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*)
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so:
Oct 30, 2019, 3:47:15 PM | WARNING | /lib/x86_64-linux-gnu/libc.so.6: abort
Oct 30, 2019, 3:47:15 PM | WARNING | /lib/x86_64-linux-gnu/libc.so.6: gsignal
Oct 30, 2019, 3:47:15 PM | WARNING | /lib/x86_64-linux-gnu/libc.so.6:
Oct 30, 2019, 3:47:15 PM | WARNING | /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
Oct 30, 2019, 3:47:15 PM | WARNING | ***************************
Oct 30, 2019, 3:47:15 PM | WARNING | * Taichi Core Stack Trace *
Oct 30, 2019, 3:47:15 PM | WARNING | ***************************
Oct 30, 2019, 3:47:15 PM | WARNING | [E 10/30/19 14:47:15.371] Received signal 6 (Aborted)
Oct 30, 2019, 3:47:15 PM | WARNING | [I 10/30/19 14:47:15.340] [base.cpp:generate_binary@125] Compilation time: 2889.9 ms
Oct 30, 2019, 3:47:12 PM | WARNING | [T 10/30/19 14:47:12.056] [logging.cpp:Logger@67] Taichi core started. Thread ID = 122

Can you please provide some insight into the possible root of the problem if you have it on top of your head?

stale welcome contribution

Most helpful comment

FINALLY!!!! I identified the problem!
Colab kernels have a libtcmalloc library installed and env variable LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 set.
Somehow it causes libstdc++ to use libunwind instead of libgcc_s for stack unwinding on exception. For some reason this causes abort during unwinding complex calls.

Running
LD_PRELOAD= python t.py,
where t.py is some taichi program works, even on GPU kernels.
I'm looking for a way to make work inside colab cells as well.

All 59 comments

Thanks for reporting this. Taichi crashes during the AST lowering process on Google Colab. The same script runs fine offline though. It might be related to the use of C++ exceptions during AST lowering, however I currently don't have a clear idea what's wrong...

In general, I think using Google colab for Taichi is a good idea. I'll dig deeper into this later.

More debug information:

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:    18.04
Codename:   bionic

Update: I tested exception throwing and it works fine on colab. May be some other reason.

I tried the 0.0.80 version, here is the error log :

[Release mode]
[T 11/04/19 10:03:46.767] [logging.cpp:Logger@67] Taichi core started. Thread ID = 154
[Taichi version 0.0.80, cpu only, commit 5ad67ce6]
[I 11/04/19 10:03:46.779] [taichi_llvm_context.cpp:TaichiLLVMContext@59] Creating llvm context for arch: x86_64
Materializing layout...
[I 11/04/19 10:03:46.832] [codegen_llvm_x86.cpp:global_optimize_module_x86_64@93] Global optimization time: 40.946 ms
[I 11/04/19 10:03:46.834] [struct_llvm.cpp:operator()@277] Allocating data structure of size 2048
Initializing runtime with 4 elements
Runtime initialized.
[E 11/04/19 10:03:46.845] Received signal 6 (Aborted)


  • Taichi Compiler Stack Traceback *

/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f27fde1ef20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/lib/x86_64-linux-gnu/libc.so.6: abort
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x1f77728) [0x7f27f28eb728]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendAssignStmt)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block
)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower_llvm()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::CPUCodeGen::lower()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::compile()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Kernel::operator()()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x9239cd) [0x7f27f12979cd]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x7801ed) [0x7f27f10f41ed]
.........

Here is the notebook where I try to install or build Taichi in colab kernel.

Also, GPU version crashes for a different reason:

[Release mode]
Using CUDA Device [0]: Tesla K80
Device Compute Capability: 3.7
[T 11/04/19 16:17:15.299] [logging.cpp:Logger@67] Taichi core started. Thread ID = 176
[Taichi version 0.0.81, cuda 10.0, commit 54751054]
[E 11/04/19 16:17:15.317] [unified_allocator.cpp:UnifiedAllocator@24] GPU memory allocation failed.
[E 11/04/19 16:17:15.317] Received signal 6 (Aborted)
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f0f30e80f20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::UnifiedAllocator::UnifiedAllocator(unsigned long, bool)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::UnifiedAllocator::create()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::Program::Program(taichi::Tlang::Arch)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x9680f9) [0x7f0f241bc0f9]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0x7c92dd) [0x7f0f2401d2dd]
...

Thanks for testing! I'll try to take a deeper look into this later today.

Hi! Do you have any update on this? I'm now trying to build llvm and taichi on colab, but it takes a while...

Hi @znah,

Sorry I haven't got a chance to work on this. I think colab is a great place for using Taichi, however, it's also very hard to debug what's wrong...

A month ago, the crash happened during Taichi IR compilation. I couldn't reproduce this on any other environment.

If you could help investigate what's wrong, that would be great! It's also worth checking if the latest python wheels of Taichi still crashes. You know, I'm in somewhere on earth without access to google.

Thanks,
Yuanming

The GPU crash is due to a virtual memory allocation issue. We should first make sure the CPU version works.

Here is the notebook where I try to build a dev version. I'm certainly doing something wrong, but I have a pre-build LLVM, so that we don't have to wait for it again.

So I basically reproduced the same error with taichi that was built on the colab from sources. Where to go from this?

Thanks for the notebook! It seems that I don't have permission to access it yet. I requested access. Could you approve? If we can build from source on colab, I think one thing to do is to do a debug build (cmake .. -DCMAKE_BUILD_TYPE="Debug"), run the script with python under gdb and see which line exactly caused the error...If gdb is not supported on colab (since it's interactive), maybe it's better to use printf...

Thanks, I have access now! It's late in my place, but let me try doing a debug build now before I go to speed.

Thank you! I've actually started the debug build already. Waiting...

Oh, thanks! I'll continue working on this first thing tomorrow morning then. I hope the crashing reason is clear under the debug build. The notebook file you have shared is super useful! Let's see what will happen :-)

Thanks again. No need to rush, I just wanted to make sure Taichi works in colab someday.
Meanwhile I have remote gdb in colab :)

All I have so far:

(gdb) info line
Line 200 of "/content/taichi/taichi/transforms/lower_ast.cpp"
   starts at address 0x7fe6fdd1ad7c <taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendAssignStmt*)+348>
   and ends at 0x7fe6fdd1adb3 <taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendAssignStmt*)+403>.

(gdb) info args
this = 0x7ffed9e0bad8
assign = 0x3127fa0

(gdb) info local
expr = {
  expr = std::shared_ptr<taichi::Tlang::Expression> (use count 1767994415, weak count 795437154) = {get() = 0x30e9910}, const_value = false, atomic = false}
flattened = {stmts = std::vector of length 5, capacity 8 = {
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x7fe70b68db10}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x7fe70b6b5510}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {
      get() = 0x7fe70b6b89f0 <_rtld_global+2448>}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x0}, 
    std::unique_ptr<taichi::Tlang::Stmt> = {get() = 0x7fe70b68db10}}}
(gdb) 

Thanks for the info! It might be due to shared pointer issues/memory corruption, but I need to dig deeper into this.

I'm making use of your notebook to build Taichi and diagnose. That's super helpful. Thank you for providing that.

It will also be helpful to have a stack backtrace when it crashes, i.e. bt in gdb, so that we know what exactly triggers the crash.
Is "remote gdb in colab" accessible to everyone or just Google people? :-)

The program crashes when the IRModified() exception is thrown.

May the crash happen due while stack unwinding (i.e some destructor is not virtual...)?
I see quite a few compiler warnings, by the way.

here is the stack:
```
(gdb) bt

0 0x00007f464844e6c2 in __GI___waitpid (pid=2903,

stat_loc=stat_loc@entry=0x7ffe2e7a0c08, options=options@entry=0)
at ../sysdeps/unix/sysv/linux/waitpid.c:30

1 0x00007f46483b9067 in do_system (line=)

at ../sysdeps/posix/system.c:149

2 0x00007f463aea9ac3 in taichi::signal_handler (signo=6)

at /content/taichi/taichi/core/logging.cpp:134

3

4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

5 0x00007f46483aa801 in __GI_abort () at abort.c:79

6 0x00007f463c7bf348 in _Unwind_Resume ()

from /content/taichi/build/taichi_core.so

7 0x00007f463b1d6244 in taichi::Tlang::LowerAST::visit (this=0x7ffe2e7a1ca8,

assign=0x3e12840) at /content/taichi/taichi/transforms/lower_ast.cpp:209

8 0x00007f463af9a7be in taichi::Tlang::FrontendAssignStmt::accept (

this=0x3e12840, visitor=0x7ffe2e7a1ca8) at /content/taichi/taichi/ir.h:1567

9 0x00007f463b1d4335 in taichi::Tlang::LowerAST::visit (this=0x7ffe2e7a1ca8,

stmt_list=0x27dd170) at /content/taichi/taichi/transforms/lower_ast.cpp:26

10 0x00007f463afa49ee in taichi::Tlang::Block::accept (this=0x27dd170,

visitor=0x7ffe2e7a1ca8) at /content/taichi/taichi/ir.h:1455

11 0x00007f463b1d420f in taichi::Tlang::LowerAST::run (node=0x27dd170)

at /content/taichi/taichi/transforms/lower_ast.cpp:285

12 0x00007f463b1d4075 in taichi::Tlang::irpass::lower (root=0x27dd170)

---Type to continue, or q to quit---
at /content/taichi/taichi/transforms/lower_ast.cpp:298

13 0x00007f463ae1af3d in taichi::Tlang::CPUCodeGen::lower_llvm (

this=0x7ffe2e7a2a48) at /content/taichi/taichi/backends/codegen_x86.cpp:706

14 0x00007f463ae1cccf in taichi::Tlang::CPUCodeGen::lower (

this=0x7ffe2e7a2a48) at /content/taichi/taichi/backends/codegen_x86.cpp:825

15 0x00007f463ae303a6 in taichi::Tlang::KernelCodeGen::compile (

this=0x7ffe2e7a2a48, prog=..., kernel=...)
at /content/taichi/taichi/backends/kernel.cpp:13

16 0x00007f463afc3ff2 in taichi::Tlang::Program::compile (this=0x2b8c580,

kernel=...) at /content/taichi/taichi/program.cpp:28

17 0x00007f463afbd9b0 in taichi::Tlang::Kernel::compile (this=0x3afe000)

at /content/taichi/taichi/kernel.cpp:37

18 0x00007f463afbda2d in taichi::Tlang::Kernel::operator() (this=0x3afe000)

at /content/taichi/taichi/kernel.cpp:43

19 0x00007f463b164663 in taichi::Tlang::SNode::write_float (this=0x3579870,

i=0, j=1, k=0, l=0, val=1) at /content/taichi/taichi/snode.cpp:136

20 0x00007f463b1337de in pybind11::cpp_function::cpp_function(void (taichi::Tlang::SNode::)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode, int, int, int, int, double)#1}::operator()(taichi::Tlang::SNode*, int, int, int, int, double) const (this=0x29b5538, c=0x3579870,

args=1, args=1, args=1, args=1, args=1)

---Type to continue, or q to quit---
at /usr/local/include/python3.6/pybind11/pybind11.h:78

21 0x00007f463b13373a in pybind11::detail::argument_loader::call_impl(void (taichi::Tlang::SNode::)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode, int, int, int, int, double)#1}&, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function(void (taichi::Tlang::SNode::)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode, int, int, int, int, double)#1}&, std::integer_sequence, pybind11::detail::void_type&&) (this=0x7ffe2e7a2ee8, f=...)

at /usr/local/include/python3.6/pybind11/cast.h:1935

22 0x00007f463b132ea6 in pybind11::detail::argument_loader::call(void (taichi::Tlang::SNode::)(int, int, int, int, double), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(taichi::Tlang::SNode, int, int, int, int, double)#1}&> (this=0x7ffe2e7a2ee8, f=...)

at /usr/local/include/python3.6/pybind11/cast.h:1917```

You can use gdb right in colab, just run the last cell, and it will git you a little prompt (with chars replaced by * :)

Thanks for the info x2. In the AST lowering pass, the transformer walks over the AST and modifies it, which might corrupt the call stack in some way. Then the program crashes during exception handling. I'll dig a bit more into it.

One possibility is that some node between the leaf node and the root (i.e. on the stack) gets deleted...

I'm trying to debug by adding printf's here and there. You can edit files right in colab, but they need to have *.py extension :/ (so I copy .cpp as .py edit and copy back)

But my C++ debugging skills are quite rusty.

I used %%writefile to add some printfs this morning and located the exception during throwing IRModified. I also tried to avoid node on the stack to be deleted, yet that doesn't fix the problem...

We may try to use some clang instrumentation, like https://clang.llvm.org/docs/AddressSanitizer.html

Fun fact: building and running with AddressSanitizer makes the example work :/
(AddressSanitizer found a lot of leaks btw)

Oh no, now we have a Heisenbug... :-(
What does the leaks info look like? I'm happy to fix them if they are from taichi.
I tried compiling on OS X with -fsanitize=address yet the leak checker doesn't seem to work well.

"The leak detection is turned on by default on Linux, and can be enabled using ASAN_OPTIONS=detect_leaks=1 on macOS"

I'm getting AddressSanitizer: detect_leaks is not supported on this platform. and ERROR: Interceptors are not working. This may be because AddressSanitizer is loaded too late (e.g. via dlopen). Please launch the executable with: DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/10.0.0/lib/darwin/libclang_rt.asan_osx_dynamic.dylib, even with the env vars set accorinding to what it asks me to do.

I guess debugging will be way easier after I'm back in Boston next Wed, with a working network and a Linux machine...

I'm running out of ideas. Will try to reach colab team to help me deploy the kernel image locally tomorrow.

Thanks for your help. Yesterday I tried AddressSanitizer and valgrind on my local machine and it seems that most human-readable leak records are from python itself and pybind11 initialization. I also tried compiling with clang-6 on my machine, yet the problem could not be reproduced...

I can finally reproduce the crash locally (in colab docker container)! Investigating...

Valgrind outputs around the crash

==4223== Mismatched free() / delete / delete []
==4223==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B1CBA4F: __gnu_cxx::new_allocator<taichi::Tlang::TypedConstant>::deallocate(taichi::Tlang::TypedConstant*, unsigned long) (new_allocator.h:125)
==4223==    by 0x2B1CBA1F: std::allocator_traits<std::allocator<taichi::Tlang::TypedConstant> >::deallocate(std::allocator<taichi::Tlang::TypedConstant>&, taichi::Tlang::TypedConstant*, unsigned long) (alloc_traits.h:462)
==4223==    by 0x2B1CB42A: std::_Vector_base<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::_M_deallocate(taichi::Tlang::TypedConstant*, unsigned long) (stl_vector.h:180)
==4223==    by 0x2B1CBA8C: std::_Vector_base<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::~_Vector_base() (stl_vector.h:162)
==4223==    by 0x2B1CADE8: std::vector<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::~vector() (stl_vector.h:435)
==4223==    by 0x2B1CACB4: taichi::Tlang::LaneAttribute<taichi::Tlang::TypedConstant>::~LaneAttribute() (ir.h:344)
==4223==    by 0x2B1CA73A: std::_MakeUniq<taichi::Tlang::ConstStmt>::__single_object std::make_unique<taichi::Tlang::ConstStmt, taichi::Tlang::TypedConstant&>(taichi::Tlang::TypedConstant&) (unique_ptr.h:825)
==4223==    by 0x2B1CA69E: std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > taichi::Tlang::Stmt::make<taichi::Tlang::ConstStmt, taichi::Tlang::TypedConstant&>(taichi::Tlang::TypedConstant&) (ir.h:617)
==4223==    by 0x2B1CA5B9: taichi::Tlang::ConstExpression::flatten(taichi::Tlang::VecStatement&) (ir.h:2071)
==4223==    by 0x2B3EE546: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:97)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==  Address 0x87fc840 is 0 bytes inside a block of size 16 alloc'd
==4223==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B1CB6E6: __gnu_cxx::new_allocator<taichi::Tlang::TypedConstant>::allocate(unsigned long, void const*) (new_allocator.h:111)
==4223==    by 0x2B1CB68B: std::allocator_traits<std::allocator<taichi::Tlang::TypedConstant> >::allocate(std::allocator<taichi::Tlang::TypedConstant>&, unsigned long) (alloc_traits.h:436)
==4223==    by 0x2B1CB342: std::_Vector_base<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::_M_allocate(unsigned long) (stl_vector.h:172)
==4223==    by 0x2B1CAF4D: std::vector<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::_M_default_append(unsigned long) (vector.tcc:571)
==4223==    by 0x2B1CAD50: std::vector<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::resize(unsigned long) (stl_vector.h:692)
==4223==    by 0x2B1CA8D6: taichi::Tlang::LaneAttribute<taichi::Tlang::TypedConstant>::LaneAttribute(taichi::Tlang::TypedConstant const&) (ir.h:354)
==4223==    by 0x2B1CA709: std::_MakeUniq<taichi::Tlang::ConstStmt>::__single_object std::make_unique<taichi::Tlang::ConstStmt, taichi::Tlang::TypedConstant&>(taichi::Tlang::TypedConstant&) (unique_ptr.h:825)
==4223==    by 0x2B1CA69E: std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > taichi::Tlang::Stmt::make<taichi::Tlang::ConstStmt, taichi::Tlang::TypedConstant&>(taichi::Tlang::TypedConstant&) (ir.h:617)
==4223==    by 0x2B1CA5B9: taichi::Tlang::ConstExpression::flatten(taichi::Tlang::VecStatement&) (ir.h:2071)
==4223==    by 0x2B3EE546: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:97)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223== 
==4223== Mismatched free() / delete / delete []
==4223==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B04197F: __gnu_cxx::new_allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >::deallocate(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >*, unsigned long) (new_allocator.h:125)
==4223==    by 0x2B04194F: std::allocator_traits<std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::deallocate(std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >&, std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >*, unsigned long) (alloc_traits.h:462)
==4223==    by 0x2B04107A: std::_Vector_base<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::_M_deallocate(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >*, unsigned long) (stl_vector.h:180)
==4223==    by 0x2B1B8358: void std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::_M_realloc_insert<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >(__gnu_cxx::__normal_iterator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >*, std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > > >, std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (vector.tcc:448)
==4223==    by 0x2B1B803D: std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >& std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::emplace_back<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (vector.tcc:105)
==4223==    by 0x2B1B13DF: std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::push_back(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (stl_vector.h:954)
==4223==    by 0x2B3F4731: taichi::Tlang::PrintStmt* taichi::Tlang::VecStatement::push_back<taichi::Tlang::PrintStmt, taichi::Tlang::Stmt*&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&>(taichi::Tlang::Stmt*&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (ir.h:279)
==4223==    by 0x2B3EE585: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:98)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223==    by 0x2B1C7BED: taichi::Tlang::Block::accept(taichi::Tlang::IRVisitor*) (ir.h:1458)
==4223==  Address 0x268a4f60 is 0 bytes inside a block of size 8 alloc'd
==4223==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B0413F6: __gnu_cxx::new_allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >::allocate(unsigned long, void const*) (new_allocator.h:111)
==4223==    by 0x2B04139B: std::allocator_traits<std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::allocate(std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >&, unsigned long) (alloc_traits.h:436)
==4223==    by 0x2B040F92: std::_Vector_base<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::_M_allocate(unsigned long) (stl_vector.h:172)
==4223==    by 0x2B1B8119: void std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::_M_realloc_insert<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >(__gnu_cxx::__normal_iterator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >*, std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > > >, std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (vector.tcc:406)
==4223==    by 0x2B1B803D: std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >& std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::emplace_back<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > >(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (vector.tcc:105)
==4223==    by 0x2B1B13DF: std::vector<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >, std::allocator<std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> > > >::push_back(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (stl_vector.h:954)
==4223==    by 0x2B1B150C: taichi::Tlang::VecStatement::push_back(std::unique_ptr<taichi::Tlang::Stmt, std::default_delete<taichi::Tlang::Stmt> >&&) (ir.h:271)
==4223==    by 0x2B1CA5C6: taichi::Tlang::ConstExpression::flatten(taichi::Tlang::VecStatement&) (ir.h:2071)
==4223==    by 0x2B3EE546: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:97)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223== 
==4223== Syscall param msync(start) points to uninitialised byte(s)
==4223==    at 0x51C7B59: msync (msync.c:25)
==4223==    by 0x62B42F3: ??? (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x62B8230: ??? (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x62B853E: ??? (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x62B8A98: ??? (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x62B4E70: _ULx86_64_step (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x62B3940: _Unwind_RaiseException (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x655DD16: __cxa_throw (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==4223==    by 0x2B3EE5EB: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:101)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223==    by 0x2B1C7BED: taichi::Tlang::Block::accept(taichi::Tlang::IRVisitor*) (ir.h:1458)
==4223==  Address 0x1ffeffd000 is on thread 1's stack
==4223==  in frame #6, created by _Unwind_RaiseException (???:)
==4223== 
==4223== Invalid read of size 8
==4223==    at 0x62B3EF6: _Ux86_64_setcontext (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223==    by 0x2B1C7BED: taichi::Tlang::Block::accept(taichi::Tlang::IRVisitor*) (ir.h:1458)
==4223==    by 0x2B3ED5CE: taichi::Tlang::LowerAST::run(taichi::Tlang::IRNode*) (lower_ast.cpp:310)
==4223==    by 0x2B3ED434: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*) (lower_ast.cpp:323)
==4223==    by 0x2B03132C: taichi::Tlang::CPUCodeGen::lower_llvm() (codegen_x86.cpp:706)
==4223==    by 0x2B0330CE: taichi::Tlang::CPUCodeGen::lower() (codegen_x86.cpp:827)
==4223==    by 0x2B046825: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&) (kernel.cpp:13)
==4223==    by 0x2B1DBC41: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&) (program.cpp:29)
==4223==    by 0x2B1D55FF: taichi::Tlang::Kernel::compile() (kernel.cpp:37)
==4223==    by 0x2B1D567C: taichi::Tlang::Kernel::operator()() (kernel.cpp:43)
==4223==  Address 0x1ffeffcfa8 is on thread 1's stack
==4223==  1912 bytes below stack pointer
==4223== 
==4223== Invalid read of size 8
==4223==    at 0x62B3EFE: _Ux86_64_setcontext (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223==    by 0x2B1C7BED: taichi::Tlang::Block::accept(taichi::Tlang::IRVisitor*) (ir.h:1458)
==4223==    by 0x2B3ED5CE: taichi::Tlang::LowerAST::run(taichi::Tlang::IRNode*) (lower_ast.cpp:310)
==4223==    by 0x2B3ED434: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*) (lower_ast.cpp:323)
==4223==    by 0x2B03132C: taichi::Tlang::CPUCodeGen::lower_llvm() (codegen_x86.cpp:706)
==4223==    by 0x2B0330CE: taichi::Tlang::CPUCodeGen::lower() (codegen_x86.cpp:827)
==4223==    by 0x2B046825: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&) (kernel.cpp:13)
==4223==    by 0x2B1DBC41: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&) (program.cpp:29)
==4223==    by 0x2B1D55FF: taichi::Tlang::Kernel::compile() (kernel.cpp:37)
==4223==    by 0x2B1D567C: taichi::Tlang::Kernel::operator()() (kernel.cpp:43)
==4223==  Address 0x1ffeffcf98 is on thread 1's stack
==4223==  1920 bytes below stack pointer
==4223== 
==4223== Invalid read of size 8
==4223==    at 0x62B3F05: _Ux86_64_setcontext (in /usr/lib/x86_64-linux-gnu/libunwind.so.8.0.1)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223==    by 0x2B1C7BED: taichi::Tlang::Block::accept(taichi::Tlang::IRVisitor*) (ir.h:1458)
==4223==    by 0x2B3ED5CE: taichi::Tlang::LowerAST::run(taichi::Tlang::IRNode*) (lower_ast.cpp:310)
==4223==    by 0x2B3ED434: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*) (lower_ast.cpp:323)
==4223==    by 0x2B03132C: taichi::Tlang::CPUCodeGen::lower_llvm() (codegen_x86.cpp:706)
==4223==    by 0x2B0330CE: taichi::Tlang::CPUCodeGen::lower() (codegen_x86.cpp:827)
==4223==    by 0x2B046825: taichi::Tlang::KernelCodeGen::compile(taichi::Tlang::Program&, taichi::Tlang::Kernel&) (kernel.cpp:13)
==4223==    by 0x2B1DBC41: taichi::Tlang::Program::compile(taichi::Tlang::Kernel&) (program.cpp:29)
==4223==    by 0x2B1D55FF: taichi::Tlang::Kernel::compile() (kernel.cpp:37)
==4223==    by 0x2B1D567C: taichi::Tlang::Kernel::operator()() (kernel.cpp:43)
==4223==  Address 0x1ffeffcf68 is on thread 1's stack
==4223==  1968 bytes below stack pointer
==4223== 
==4223== Mismatched free() / delete / delete []
==4223==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B0C117D: taichi::signal_handler(int) (logging.cpp:127)
==4223==    by 0x50EAF1F: ??? (in /lib/x86_64-linux-gnu/libc-2.27.so)
==4223==    by 0x50EAE96: __libc_signal_restore_set (nptl-signals.h:80)
==4223==    by 0x50EAE96: raise (raise.c:48)
==4223==    by 0x50EC800: abort (abort.c:79)
==4223==    by 0x2C9D8F27: _Unwind_Resume (in /content/taichi/taichi/build/taichi_core.so)
==4223==    by 0x2B3EE639: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:95)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223==    by 0x2B3ED71D: taichi::Tlang::LowerAST::visit(taichi::Tlang::Block*) (lower_ast.cpp:40)
==4223==    by 0x2B1C7BED: taichi::Tlang::Block::accept(taichi::Tlang::IRVisitor*) (ir.h:1458)
==4223==    by 0x2B3ED5CE: taichi::Tlang::LowerAST::run(taichi::Tlang::IRNode*) (lower_ast.cpp:310)
==4223==    by 0x2B3ED434: taichi::Tlang::irpass::lower(taichi::Tlang::IRNode*) (lower_ast.cpp:323)
==4223==  Address 0x27293900 is 0 bytes inside a block of size 28 alloc'd
==4223==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x65F23AC: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25)
==4223==    by 0x2AF8B5AC: fmt::BasicWriter<char>::str[abi:cxx11]() const (format.h:2601)
==4223==    by 0x2AF8B300: fmt::format[abi:cxx11](fmt::BasicCStringRef<char>, fmt::ArgList) (format.h:3295)
==4223==    by 0x2AF9B58D: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::format<int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(fmt::BasicCStringRef<char>, int const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (format.h:3586)
==4223==    by 0x2B0C1158: taichi::signal_handler(int) (logging.cpp:128)
==4223==    by 0x50EAF1F: ??? (in /lib/x86_64-linux-gnu/libc-2.27.so)
==4223==    by 0x50EAE96: __libc_signal_restore_set (nptl-signals.h:80)
==4223==    by 0x50EAE96: raise (raise.c:48)
==4223==    by 0x50EC800: abort (abort.c:79)
==4223==    by 0x2C9D8F27: _Unwind_Resume (in /content/taichi/taichi/build/taichi_core.so)
==4223==    by 0x2B3EE639: taichi::Tlang::LowerAST::visit(taichi::Tlang::FrontendPrintStmt*) (lower_ast.cpp:95)
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223== 

Thank you so much!! This is really helpful information. I saw quite a few suspicious clues and let me dive deeper.

The first two warnings are really mysterious. For example,

==4223== Mismatched free() / delete / delete []
==4223==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B1CBA4F: __gnu_cxx::new_allocator<taichi::Tlang::TypedConstant>::deallocate(taichi::Tlang::TypedConstant*, unsigned long) (new_allocator.h:125)
...
==4223==  Address 0x87fc840 is 0 bytes inside a block of size 16 alloc'd
==4223==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4223==    by 0x2B1CB6E6: __gnu_cxx::new_allocator<taichi::Tlang::TypedConstant>::allocate(unsigned long, void const*) (new_allocator.h:111)
==4223==    by 0x2B1CB68B: std::allocator_traits<std::allocator<taichi::Tlang::TypedConstant> >::allocate(std::allocator<taichi::Tlang::TypedConstant>&, unsigned long) (alloc_traits.h:436)
==4223==    by 0x2B1CB342: std::_Vector_base<taichi::Tlang::TypedConstant, std::allocator<taichi::Tlang::TypedConstant> >::_M_allocate(unsigned long) (stl_vector.h:172)
...
==4223==    by 0x2B1D49AD: taichi::Tlang::FrontendPrintStmt::accept(taichi::Tlang::IRVisitor*) (ir.h:1716)
==4223== 

If I understand correctly, it seems that on the colab kernel, std::vector uses new [] to allocate memory yet free() to release it. It's hard to believe the std lib of C++ has an issue like this. Maybe the ctor/dtor of std::vector got linked to libstdc++ of different versions? Maybe there's a better explanation.


A sample code to reproduce the same warning from Valgrind:

#include <cstdlib>

int main() {
  auto ptr = new int[4];
  free(ptr); // mismatched new[]/free

  // This one works with no warning from valgrind
  // delete [] ptr;
  return 0;
}

g++ test.cpp -o test && valgrind ./test gives

==27530== Memcheck, a memory error detector
==27530== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27530== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==27530== Command: ./test
==27530== 
==27530== Mismatched free() / delete / delete []
==27530==    at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27530==    by 0x1086EB: main (in /home/yuanming/repos/tmp/test)
==27530==  Address 0x5b7dc80 is 0 bytes inside a block of size 16 alloc'd
==27530==    at 0x4C3089F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27530==    by 0x1086DB: main (in /home/yuanming/repos/tmp/test)
==27530== 
==27530== 
==27530== HEAP SUMMARY:
==27530==     in use at exit: 0 bytes in 0 blocks
==27530==   total heap usage: 2 allocs, 2 frees, 72,720 bytes allocated
==27530== 
==27530== All heap blocks were freed -- no leaks are possible
==27530== 
==27530== For counts of detected and suppressed errors, rerun with: -v
==27530== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

It's also hard to believe issue like this only crashes Taichi, as many more other programs use something like std::vector. Maybe it's worth asking the colab people if they have run into any related issues? Thanks.

I'm actually more suspicious about those errors related to libunwind. Some
ABI incompatibility may be...

On Wed, Dec 18, 2019, 20:00 Yuanming Hu notifications@github.com wrote:

It's also hard to believe issue like this happens only crashes Taichi, as
many more other programs use std::vector. Maybe it's worth asking the
colab people if they have run into any related issues? Thanks.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yuanming-hu/taichi/issues/235?email_source=notifications&email_token=AAT2ZOGMUPQS4F4PBPFOF3LQZJXMVA5CNFSM4JG2YL72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHHEECI#issuecomment-567165449,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAT2ZOAS7DYFOACXYTDOEG3QZJXMVANCNFSM4JG2YL7Q
.

By the way, allocating 1<<44 bytes of vmem didn't work under valgrind :(
Had to reduce to 1<<32.

On Wed, Dec 18, 2019, 20:14 Alexander Mordvintsev zzznah@gmail.com wrote:

I'm actually more suspicious about those errors related to libunwind.
Some ABI incompatibility may be...

On Wed, Dec 18, 2019, 20:00 Yuanming Hu notifications@github.com wrote:

It's also hard to believe issue like this happens only crashes Taichi, as
many more other programs use std::vector. Maybe it's worth asking the
colab people if they have run into any related issues? Thanks.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/yuanming-hu/taichi/issues/235?email_source=notifications&email_token=AAT2ZOGMUPQS4F4PBPFOF3LQZJXMVA5CNFSM4JG2YL72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHHEECI#issuecomment-567165449,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAT2ZOAS7DYFOACXYTDOEG3QZJXMVANCNFSM4JG2YL7Q
.

Hi @znah

I think there are two possibilities

  1. Somewhere in Taichi there's a silent memory access error that corrupts libunwind
  2. libunwind itself goes wrong.

I tried some basic exception throwing yet nothing went wrong. Maybe the compiler simply optimized exception handling out.

To debug, I implemented a slightly more complicated exception handling test: https://github.com/yuanming-hu/taichi/blob/master/taichi/exception_handling_tests.cpp to see if 2 is the cause.

Could you build the latest taichi in the colab kernel, and run

python3 -m taichi test_exception_handling_auto

Thank you so much for your help. I do think colab is a great place for Taichi and I really hope this can happen.

About virtual memory: yes, I'm trying to design Taichi's memory allocator to make it less dependent on huge virtual memory pools...

Hi Yuanming,
I tried to run your tests, everything passes (although minimal.py still crashes): https://gist.github.com/znah/977894f23aaac8b61eb93c1a048ec372

Happy Holidays, and hope we resolve this issue in 2020! :)

Oh no, this info makes it more mysterious :-( I wish the cause would be as "simple" as an exception handling issue.

Thanks, and happy holidays.

Even I use the minimal example (https://github.com/taichi-dev/taichi/blob/master/examples/minimal.py) to create a simple notebook, the session crashed.

!pip install taichi-nightly

import taichi as ti


@ti.kernel
def p():
  print(42)


p()

Error message: Your session crashed for an unknown reason. View runtime logs

FINALLY!!!! I identified the problem!
Colab kernels have a libtcmalloc library installed and env variable LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 set.
Somehow it causes libstdc++ to use libunwind instead of libgcc_s for stack unwinding on exception. For some reason this causes abort during unwinding complex calls.

Running
LD_PRELOAD= python t.py,
where t.py is some taichi program works, even on GPU kernels.
I'm looking for a way to make work inside colab cells as well.

WOW!!!!!! FINALLY!!!!!!! This is a really tricky problem to pinpoint - thank you so much for debugging this!!

I guess this will cause other programs that use exceptions to crash on Colab (and I guess the fact that Google does not use C++ exceptions makes this problem more deeply hidden...)

It's even trickier. I suspect some ABI incompatibility between clang and libunwind, that manifests itself only on unwinding complex virtual calls. So quite few programs are probably affected.

Very cool!! What do you think could be a systematic way to solve this? Recompiling Taichi using gcc instead of clang might cause other problems. Would it be possible to override LD_PRELOAD in Colab somewhere through the Colab GUI?

The real way to rectify this issue is to fix a bug somewhere in either clang, or in (nongnu) libunwind, or in tcmalloc. I don't feel like being capable to do this. I'll discuss potential solutions with the Colab team.

I don't think I'm able to fix that bug either. Maybe some help from the Colab team would help. Thank you so much for making everything here happen! :-)

By the way, while 0.5.2 works, 0.5.3 crashes for a different reason: [W
02/25/20 17:47:41.893] [taichi_llvm_context.cpp:module_from_bitcode_file@186]
Bitcode loading error message:
Invalid bitcode signature

On Tue, Feb 25, 2020, 18:46 Yuanming Hu notifications@github.com wrote:

I don't think I'm able to fix that bug either. Maybe some help from the
Colab team would help. Thank you so much for making everything here happen!
:-)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/taichi-dev/taichi/issues/235?email_source=notifications&email_token=AAT2ZOHN2IJCWJ3KBQTTVDDREVKPPA5CNFSM4JG2YL72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM4ZVRY#issuecomment-590977735,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAT2ZOFHOGMTKYT6X3DRHC3REVKPPANCNFSM4JG2YL7Q
.

Oh no.. I'll take a look later today. Thanks for reporting this!

Interesting observation from the Colab team: Taichi works when using tcmalloc_minimal instead of tcmalloc. Relevant bits of documentation:

To use TCMalloc, just link TCMalloc into your application via the "-ltcmalloc" linker flag.

You can use TCMalloc in applications you didn't compile yourself, by using LD_PRELOAD:

   $ LD_PRELOAD="/usr/lib/libtcmalloc.so" 
LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage.

TCMalloc includes a heap checker and heap profiler as well.

If you'd rather link in a version of TCMalloc that does not include the heap profiler and checker (perhaps to reduce binary size for a static binary), you can link in libtcmalloc_minimal instead.

also this

NOTE: When compiling with programs with gcc, that you plan to link
with libtcmalloc, it's safest to pass in the flags

 -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free

when compiling.  gcc makes some optimizations assuming it is using its
own, built-in malloc; that assumption obviously isn't true with
tcmalloc.  In practice, we haven't seen any problems with this, but
the expected risk is highest for users who register their own malloc
hooks with tcmalloc (using gperftools/malloc_hook.h).  The risk is
lowest for folks who use tcmalloc_minimal (or, of course, who pass in
the above flags :-) ).

I'm continuing the investigation.

Every version >0.5.2 on Colab (0.5.2 works fine)

Invalid bitcode signature
Program aborted due to an unhandled Error:
Invalid bitcode signature[W 03/03/20 17:49:10.868] [llvm_context.cpp:module_from_bitcode_file@187] Bitcode loading error message:
[E 03/03/20 17:49:10.868] [llvm_context.cpp:module_from_bitcode_file@189] Bitcode /usr/local/lib/python3.6/dist-packages/taichi/core/../lib/runtime_x64.bc load failure.
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f60d8c16f20]
/lib/x86_64-linux-gnu/libc.so.6: gsignal
/lib/x86_64-linux-gnu/libc.so.6: abort
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: llvm::StringError::StringError(std::error_code, llvm::Twine const&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::module_from_bitcode_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::LLVMContext*)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::TaichiLLVMContext::clone_runtime_module()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::TaichiLLVMContext::get_init_module()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::StructCompilerLLVM::StructCompilerLLVM(taichi::lang::Program*, taichi::lang::Arch)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::StructCompiler::make(taichi::lang::Program*, taichi::lang::Arch)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::Program::materialize_layout()
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so: taichi::lang::layout(std::function<void ()> const&)
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0xd41d59) [0x7f60caf0ad59]
/usr/local/lib/python3.6/dist-packages/taichi/core/../lib/taichi_core.so(+0xb7449d) [0x7f60cad3d49d]

Sorry about that. The bitcode loading issue should be fixed in v0.5.6. The buildbots are currently working on compiling/releasing the new version.

Warning: The issue has been out-of-update for 50 days, marking stale.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jackalcooper picture jackalcooper  Â·  4Comments

kazimuth picture kazimuth  Â·  4Comments

archibate picture archibate  Â·  4Comments

quadpixels picture quadpixels  Â·  3Comments

archibate picture archibate  Â·  3Comments