Taichi: [CUDA] detected to be supported and crash on card without unified memory

Created on 13 Apr 2020  路  15Comments  路  Source: taichi-dev/taichi

Describe the bug
CUDA detected to be SUPPORTED on a machine without CUDA.
It's because is_cuda_api_avaliable returned true even if I don't have CUDA.

Log/Screenshots

(yuanming-hu/glfw) [bate@archit taichi]$ python examples/mpm128.py  
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-7haz507t
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 4e2e5605, python 3.8.2
[Hint] Use WSAD/arrow keys to control gravity. Use left/right mouse bottons to attract/repel. Press R to reset.
[W 04/13/20 09:29:21.266] [cuda_driver.h:call_with_warning@60] CUDA Error CUDA_ERROR_INVALID_DEVICE: invalid device ordinal while calling mem_advise (cuMemAdvise)
[E 04/13/20 09:29:21.860] Received signal 7 (Bus error)


***********************************
* Taichi Compiler Stack Traceback *                                                          
***********************************                                                          
/tmp/taichi-7haz507t/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)                                  
/tmp/taichi-7haz507t/taichi_core.so: taichi::signal_handler(int)                             
/usr/lib/libc.so.6(+0x3bd70) [0x7f359062bd70]                                                
/tmp/taichi-7haz507t/taichi_core.so: taichi::lang::MemoryPool::daemon()
/usr/lib/libstdc++.so.6(+0xcfb24) [0x7f357ff41b24]
/usr/lib/libpthread.so.0(+0x946f) [0x7f359021746f]
/usr/lib/libc.so.6: clone
GNU gdb (GDB) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 8383
[New LWP 8388]
[New LWP 8389]
[New LWP 8390]
[New LWP 8391]
[New LWP 8396]
[New LWP 8397]
[New LWP 8398]
[New LWP 8399]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f3580fc910c in llvm::Twine::toVector(llvm::SmallVectorImpl<char>&) const ()
   from /tmp/taichi-7haz507t/taichi_core.so
(gdb) 

To Reproduce
Just run the example/mpm128.py.

If you have local commits (e.g. compile fixes before you reproduce the bug), please make sure you first make a PR to fix the build errors and then report the bug.

dependency potential bug

Most helpful comment

Yes, it did.

(gdbtrig) [bate@archit taichi]$ TI_USE_UNIFIED_MEMORY=0 p examples/fractal.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-mxhjexut
[Taichi] sandbox prepared
[I 04/13/20 09:42:03.300] [cuda_driver.cpp:CUDADriver@30] CUDA DETECTED
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 4e2e5605, python 3.8.2
X connection to :0 broken (explicit kill or server shutdown).

The with_cuda still returns true however, according to my TI_INFO("CUDA_DETECTED");.

All 15 comments

Introduced in #756

Does setting envvar TI_USE_UNIFIED_MEMORY=0 fix your problem?

Yes, it did.

(gdbtrig) [bate@archit taichi]$ TI_USE_UNIFIED_MEMORY=0 p examples/fractal.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-mxhjexut
[Taichi] sandbox prepared
[I 04/13/20 09:42:03.300] [cuda_driver.cpp:CUDADriver@30] CUDA DETECTED
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 4e2e5605, python 3.8.2
X connection to :0 broken (explicit kill or server shutdown).

The with_cuda still returns true however, according to my TI_INFO("CUDA_DETECTED");.

I never tried the CUDA backend without unified memory on Linux - glad that it works now. Maybe we should simply disable CUDA backend unified memory on NVIDIA GTX 9 series on Linux? It's better than not having a CUDA backend at all, for these users.

How important unified memory is? If not having unified memory will cause taichi fail to run, maybe we want to make with_cuda return false when unified memory not supported?

Or, we should detect if a card support UM and set TI_UNIFIED_MEMORY=0 or 1 automatically? Or we should default to 0?

USE_UM=0, and got OOM when testing numpy:

______________________________________ test_numpy_2d _______________________________________
[gw2] linux -- Python 3.8.2 /usr/bin/python3

    @ti.all_archs
    def test_numpy_2d():
        val = ti.var(ti.i32)

        n = 4
        m = 7

        @ti.layout
        def values():
            ti.root.dense(ti.i, n).dense(ti.j, m).place(val)

        @ti.kernel
        def test_numpy(arr: ti.ext_arr()):
            for i in range(n):
                for j in range(m):
                    arr[i, j] += i + j

        a = np.empty(shape=(n, m), dtype=np.int32)

        for i in range(n):
            for j in range(m):
                a[i, j] = i * j

>       test_numpy(a)

tests/python/test_numpy.py:75: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/taichi/lang/kernel.py:484: in wrapped
    primal(*args, **kwargs)
python/taichi/lang/kernel.py:414: in __call__
    self.materialize(key=key, args=args, arg_features=arg_features)
python/taichi/lang/kernel.py:220: in materialize
    self.runtime.materialize()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <taichi.lang.impl.PyTaichi object at 0x7ff1406bf970>

    def materialize(self):
        if self.materialized:
            return
        self.create_program()
        Expr.layout_materialized = True

        def layout():
            for func in self.layout_functions:
                func()

        import taichi as ti
        ti.trace("Materializing layout...".format())
>       taichi_lang_core.layout(layout)
E       RuntimeError: [cuda_driver.h:operator()@66] CUDA Error CUDA_ERROR_OUT_OF_MEMORY: out of memory while calling malloc (cuMemAlloc_v2)

python/taichi/lang/impl.py:179: RuntimeError
----------------------------------- Captured stdout call ------------------------------------
Running test on arch=Arch.x64
Running test on arch=Arch.cuda
[E 04/13/20 15:14:17.564] [cuda_driver.h:operator()@66] CUDA Error CUDA_ERROR_OUT_OF_MEMORY: out of memory while calling malloc (cuMemAlloc_v2)

I think I found the problem of OOM:

MULTI-THREADING!!!!

No error with -t1.

Do you have 2 NV card? So your 2 cores running good. And I have 1 NV card, but 4 cores, thus resource not enough to lock, and oom.

Seems the first one to run will obtain the resource, and the second will OOM:

(glew) [bate@archit taichi]$ p multirestart.my.py & p multirestart.my.py  
[1] 5805
[Taichi] mode=development
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-vs1ag8qy
[Taichi] preparing sandbox at /tmp/taichi-iukg5cuq
[Taichi] sandbox prepared
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 2ffc1e55, python 3.8.2
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 2ffc1e55, python 3.8.2
[E 04/13/20 22:57:47.637] [cuda_driver.h:operator()@66] CUDA Error CUDA_ERROR_OUT_OF_MEMORY: out of memory while calling malloc (cuMemAlloc_v2)


***********************************
* Taichi Compiler Stack Traceback *                                                          
***********************************                                                          
/tmp/taichi-iukg5cuq/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)                                  
/tmp/taichi-iukg5cuq/taichi_core.so: taichi::lang::CUDADriverFunction<void*, unsigned long>::operator()(void*, unsigned long)                                                             
/tmp/taichi-iukg5cuq/taichi_core.so: taichi::lang::Program::initialize_runtime_system(taichi::lang::StructCompiler*)                                                                      
/tmp/taichi-iukg5cuq/taichi_core.so: taichi::lang::Program::materialize_layout()             
/tmp/taichi-iukg5cuq/taichi_core.so: taichi::lang::layout(std::function<void ()> const&)     
/tmp/taichi-iukg5cuq/taichi_core.so(+0x7dea29) [0x7f5e046a3a29]                              
/tmp/taichi-iukg5cuq/taichi_core.so(+0x68f566) [0x7f5e04554566]                              
/usr/lib/libpython3.8.so.1.0: PyCFunction_Call                                               
/usr/lib/libpython3.8.so.1.0: _PyObject_MakeTpCall                                           
/usr/lib/libpython3.8.so.1.0: _PyEval_EvalFrameDefault                                       
/usr/lib/libpython3.8.so.1.0: _PyEval_EvalCodeWithName                                       
/usr/lib/libpython3.8.so.1.0(+0x1e0902) [0x7f5e1413a902]                                     
/usr/lib/libpython3.8.so.1.0: _PyEval_EvalFrameDefault                                       
/usr/lib/libpython3.8.so.1.0: _PyFunction_Vectorcall                                         
/usr/lib/libpython3.8.so.1.0(+0x17e0ad) [0x7f5e140d80ad]                                     
/usr/lib/libpython3.8.so.1.0(+0x17eec8) [0x7f5e140d8ec8]                                     
/usr/lib/libpython3.8.so.1.0: _PyEval_EvalFrameDefault                                       
/usr/lib/libpython3.8.so.1.0: _PyEval_EvalCodeWithName                                       
/usr/lib/libpython3.8.so.1.0: PyEval_EvalCode                                                
/usr/lib/libpython3.8.so.1.0(+0x2668c8) [0x7f5e141c08c8]                                     
/usr/lib/libpython3.8.so.1.0(+0x26aba3) [0x7f5e141c4ba3]                                     
/usr/lib/libpython3.8.so.1.0: PyRun_FileExFlags                                              
/usr/lib/libpython3.8.so.1.0: PyRun_SimpleFileExFlags                                        
/usr/lib/libpython3.8.so.1.0: Py_RunMain
/usr/lib/libpython3.8.so.1.0: Py_BytesMain
/usr/lib/libc.so.6: __libc_start_main
python(_start+0x2e) [0x563f74cfc05e]
GNU gdb (GDB) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 5806
[New LWP 5819]
[New LWP 5820]
[New LWP 5821]
[New LWP 5822]
[New LWP 5829]
[New LWP 5830]
[New LWP 5832]
[New LWP 5833]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
3
0x00007f5e143e69af in wait4 () from /usr/lib/libc.so.6
(gdb) 

I found: Allocating device memory 1.00 GB

Why allocate so much???

I would argue that I only have 2GB gpu memory in total... and of course it crashes in multi-threading test.

Solution 1: device_memory_fraction = 1 / (threads + 1) in test.
Solution 2: spinlock until memory enough in test.

I guess solution 1 is probably easier. Or we can just ask people not to use too many threads when GPU memory is scarce. (Sorry about my delayed reply - workday starts on my end so I have meetings in the morning...)

...Sorry about my delayed reply...

No rush at all!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuanming-hu picture yuanming-hu  路  3Comments

yuanming-hu picture yuanming-hu  路  3Comments

archibate picture archibate  路  4Comments

yuanming-hu picture yuanming-hu  路  3Comments

GeoffreyPlitt picture GeoffreyPlitt  路  4Comments