Taichi: Failed to run fractal.py on GPU: cudaErrorInvalidDevice: invalid device ordinal

Created on 11 Feb 2020 · 12Comments · Source: taichi-dev/taichi

With the latest version of taichi-nightly-cuda-10-0 installed, I run the sample code and get an error:

(base) [bate@archit gpu]$ python fractal.py   
[Release mode]
[T 02/11/20 18:41:37.756] [logging.cpp:Logger@68] Taichi core started. Thread ID = 1273
[Taichi version 0.4.5, cuda 10.0, commit 8b575a8e]
[I 02/11/20 18:41:37.977] [llvm_jit_ptx.cpp:CUDAContext@154] Using CUDA Device [id=0]: GeForce 940MX
[I 02/11/20 18:41:37.977] [llvm_jit_ptx.cpp:CUDAContext@162] CUDA Device Compute Capability: 5.0
[I 02/11/20 18:41:38.046] [memory_pool.cpp:MemoryPool@15] Memory pool created. Default buffer size per allocator = 1024 MB
[I 02/11/20 18:41:38.047] [taichi_llvm_context.cpp:TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[D 02/11/20 18:41:38.047] [snode.cpp:create_node@48] Non-power-of-two node size 640 promoted to 1024.
[D 02/11/20 18:41:38.047] [snode.cpp:create_node@48] Non-power-of-two node size 320 promoted to 512.
[I 02/11/20 18:41:38.067] [/home/bate/anaconda3/lib/python3.7/site-packages/taichi/lang/impl.py:materialize@127] Materializing layout...
[I 02/11/20 18:41:38.403] [struct_llvm.cpp:operator()@287] Allocating data structure of size 2097152 B
[I 02/11/20 18:41:38.408] [unified_allocator.cpp:UnifiedAllocator@17] Allocating unified (CPU+GPU) address space of size 1024 MB
[E 02/11/20 18:41:38.488] [unified_allocator.cpp:UnifiedAllocator@27] Cuda Error cudaErrorInvalidDevice: invalid device ordinal
[E 02/11/20 18:41:38.488] Received signal 6 (Aborted)
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/home/bate/anaconda3/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/usr/lib/libc.so.6(+0x3bfb0) [0x7fce10d35fb0]
/usr/lib/libpthread.so.0: raise
/home/bate/anaconda3/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::UnifiedAllocator::UnifiedAllocator(unsigned long, taichi::Tlang::Arch)
/home/bate/anaconda3/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::Tlang::MemoryPool::allocate(unsigned long, unsigned long)
[0x7fce0043b02e]
(base) [bate@archit gpu]$ nvidia-smi
Tue Feb 11 18:54:20 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce 940MX       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   38C    P0    N/A /  N/A |      0MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) [bate@archit gpu]$ uname -a
Linux archit 5.4.13-arch1-1 #1 SMP PREEMPT Fri, 17 Jan 2020 23:09:54 +0000 x86_64 GNU/Linux
(base) [bate@archit gpu]$

stale

Source

archibate

All 12 comments

I take a look at taichi/unified_allocator.cpp:27:

cudaMemAdvise(_cuda_data, size, cudaMemAdviseSetPreferredLocation, 0);

And write a test code test.cu:

#define CE(err)  \
    do { auto __err = (err); if (__err) {    \
        fprintf(stderr, "CUDA Error in %s:%d: %s\n",    \
        __FILE__, __LINE__, cudaGetErrorString(__err)); \
        exit(1);                                        \
    }} while (0)

#include <cstdio>

template<typename T>
__global__ void kernel(size_t *value)
{
    *value = sizeof(T);
}

int main()
{
    size_t size = 1024 * 1024 * 1024;
    size_t *data = NULL;
    CE(cudaMallocManaged(&data, size));
    CE(cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, 0));
    kernel<double><<<1, 1>>>(data);
    CE(cudaGetLastError());
    CE(cudaDeviceSynchronize());
    printf("data[0] = %d\n", data[0]);
    CE(cudaFree(data));
    return 0;
}

It's result (as we expected):

CUDA Error in test.cu:23: invalid device ordinal

If I change

cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, 0);

cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);

It's result:

data[0] = 8

(cudaCpuDeviceId evaluates -1 on my machine with 4 CPU core)

Is this a right fix for the issue?

archibate on 11 Feb 2020

👍1

The reason is that pre-Pascal cards (e.g. GTX 940M) do not support cudaMemAdvise. In the latest commit I have weakened this check to throw a warning only. Could you have a try?

(cudaMemAdvise here is just to make sure data stays on GPU memory as much as possible under unified memory settings since Pascal.)

yuanming-hu on 12 Feb 2020

👍1

Thanks for your suggestion! It might works but, I failed to build taichi for GPU due to compilation reasons. When compiling runtime.cpp, it says void vprintf(...) doesn't match it's declaration int vprintf(...) in . After I commented vprintf from PrintfXXX, more other errors come out...
Maybe I could wait for next release, thanks :)

archibate on 13 Feb 2020

Yeah, the printf vprintf mess is leading to a lot of portability issues. A systematic solution could be https://github.com/taichi-dev/taichi/issues/470

yuanming-hu on 16 Feb 2020

👍1

The reason is that pre-Pascal cards (e.g. GTX 940M) do not support cudaMemAdvise. In the latest commit I have weakened this check to throw a warning only. Could you have a try?

Yeah, it's a warning now, but I got a segfault after that:

[Taichi version 0.5.0, cuda 10.0, commit 55ec6dfb]
[Hint] Use WSAD/arrow keys to control gravity. Use left/right mouse bottons to attract/repel. (OS X not yet supported)
[W 02/16/20 12:32:33.422] [unified_allocator.cpp:UnifiedAllocator@38] Cuda Error cudaErrorInvalidDevice (treated as warning): invalid device ordinal
[E 02/16/20 12:32:33.434] Received signal 11 (Segmentation fault)
***********************************
* Taichi Compiler Stack Traceback *                                                                         
***********************************                                                                         
/home/bate/Develop/taichi/build/taichi_core.so: taichi::signal_handler(int)                                 
/usr/lib/libc.so.6(+0x3bfb0) [0x7fa9876b9fb0]                                                               
[0x7fa9771bc094]

GNU gdb (GDB) 8.3.1
.......................................................
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fa98774a796 in waitpid () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fa98774a796 in waitpid () from /usr/lib/libc.so.6
#1  0x00007fa9876c6f37 in do_system () from /usr/lib/libc.so.6
#2  0x00007fa977abc1d8 in taichi::signal_handler (signo=11) at ../taichi/core/logging.cpp:145
#3  <signal handler called>
#4  0x00007fa9771bc094 in ?? ()
#5  0x0000000000000000 in ?? ()
(gdb)

Stack seems broken...

archibate on 16 Feb 2020

https://github.com/taichi-dev/taichi/blob/b66f216ef9903ae67aafadece00203edbb143306/taichi/unified_allocator.cpp#L38
I simply remove this line, it works.
My suggestion: if (!user_is_pascal_card()) cudaMemAdvice(...);

archibate on 16 Feb 2020

Oh I see. It seems that cudaMemAdvice crashes the program instead of returns with an error code?

yuanming-hu on 16 Feb 2020

Oh I see. It seems that cudaMemAdvice crashes the program instead of returns with an error code?

It did returns error, but after reporting the error, it crashes somehow.

BTW:
https://github.com/taichi-dev/taichi/blob/16c1e2297e4fdf29c281656044533ef59a061d0e/taichi/cuda_utils.h#L17
use do { ... } while (0) instead of { ... } is more safe for multi-line macros. eg.

if (x)
  check_cuda_errors(233);
else
  do_other();

becomes:

if (x)
{ ... };
else
  do_other();

extra comma causes else wrong. especially harmful when you have multiple ifs above, you don't get an error.

archibate on 16 Feb 2020

👀1 👍1

After fixed cudaMemAdvice by removing it, I got an Bus Error:

[Taichi version 0.5.0, cuda 10.0, commit 55ec6dfb]
[Hint] Use WSAD/arrow keys to control gravity. Use left/right mouse bottons to attract/repel. (OS X not yet supported)
[E 02/16/20 13:52:32.652] Received signal 7 (Bus error)
***********************************
* Taichi Compiler Stack Traceback *                                                                         
***********************************                                                                         
/home/bate/Develop/taichi/build/taichi_core.so: taichi::signal_handler(int)                                 
/usr/lib/libc.so.6(+0x3bfb0) [0x7ff83dd40fb0]                                                               
/home/bate/Develop/taichi/build/taichi_core.so: taichi::Tlang::MemoryPool::daemon()                         
/usr/lib/libstdc++.so.6(+0xcfee4) [0x7ff82c4f7ee4]                                                          
/usr/lib/libpthread.so.0(+0x94cf) [0x7ff83d9744cf]                                                          
/usr/lib/libc.so.6: clone                                                                                   


GNU gdb (GDB) 8.3.1
Attaching to process 16400
[New LWP 16405]
[New LWP 16406]
[New LWP 16407]
[New LWP 16408]
[New LWP 16412]
[New LWP 16413]
[New LWP 16414]
[New LWP 16415]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007ff82e807272 in llvm::PrettyStackTraceEntry::~PrettyStackTraceEntry() ()
   from /home/bate/Develop/taichi/build/taichi_core.so
(gdb) bt
#0  0x00007ff82e807272 in llvm::PrettyStackTraceEntry::~PrettyStackTraceEntry() ()
   from /home/bate/Develop/taichi/build/taichi_core.so
#1  0x00007ff82e4338f7 in llvm::FPPassManager::runOnFunction(llvm::Function&) ()
   from /home/bate/Develop/taichi/build/taichi_core.so
#2  0x00007ff82ea23baf in (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) ()
   from /home/bate/Develop/taichi/build/taichi_core.so
#3  0x00007ff82e432f1d in llvm::legacy::PassManagerImpl::run(llvm::Module&) ()
   from /home/bate/Develop/taichi/build/taichi_core.so
#4  0x00007ff82e110800 in taichi::Tlang::compile_module_to_ptx[abi:cxx11](std::unique_ptr<llvm::Module, std::default_delete<llvm::Module> >&) (module=std::unique_ptr<class llvm::Module> = {...})
    at ../taichi/backends/llvm_jit_ptx.cpp:138
#5  0x00007ff82e0c55ce in taichi::Tlang::CodeGenLLVMGPU::compile_module_to_executable (this=0x7fff8e2f27d8)
    at ../taichi/backends/codegen_llvm_ptx.cpp:83
#6  0x00007ff82e0baaa6 in taichi::Tlang::CodeGenLLVM::gen (this=<optimized out>)
    at ../taichi/backends/codegen_llvm.h:252
#7  taichi::Tlang::GPUCodeGen::codegen_llvm (this=<optimized out>)
    at ../taichi/backends/codegen_llvm_ptx.cpp:374
#8  0x00007ff82e10f2fb in taichi::Tlang::KernelCodeGen::compile (this=0x7fff8e2f2a60, prog=..., kernel=...)
    at ../taichi/backends/kernel.cpp:16
#9  0x00007ff82e1d6a7e in taichi::Tlang::Program::compile (this=0x55aaa227f440, kernel=...)
    at ../taichi/program.cpp:85
#10 0x00007ff82e1cd78b in taichi::Tlang::Kernel::compile (this=0x55aaa373e7a0) at ../taichi/kernel.cpp:39
#11 0x00007ff82e1cd833 in taichi::Tlang::Kernel::operator() (this=0x55aaa373e7a0)
    at ../taichi/kernel.cpp:45
#12 0x00007ff82e248224 in taichi::export_lang(pybind11::module&)::$_8::operator()(taichi::Tlang::Kernel*) const (kernel=0x55aaa373e7a0, this=<optimized out>) at ../taichi/python_bindings.cpp:166
#13 pybind11::detail::argument_loader<taichi::Tlang::Kernel*>::call_impl<void, taichi::export_lang(pybind11::module&)::$_8&, 0ul, pybind11::detail::void_type>(taichi::export_lang(pybind11::module&)::$_8&, std::integer_sequence<unsigned long, 0ul>, pybind11::detail::void_type&&) (this=<optimized out>, f=...)
    at /home/bate/.local/include/python3.8/pybind11/cast.h:1935
#14 pybind11::detail::argument_loader<taichi::Tlang::Kernel*>::call<void, pybind11::detail::void_type, taich--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb)

archibate on 16 Feb 2020

use do { ... } while (0) instead of { ... } is more safe for multi-line macros.

Oh I see. That's a good point! Could you fix this?

The bus error is likely related to the memory pool daemon, which might depend on some Pascal features... We need to do preallocation on pre-Pascal cards for compatibility...

yuanming-hu on 16 Feb 2020

🚀1

Warning: The issue has been out-of-update for 50 days, marking stale.