Horovod: CUDA error: device not ready

Created on 2 Jul 2020 · 3Comments · Source: horovod/horovod

Environment:

Framework: PyTorch
Framework version:1.5.0
Horovod version:0.19.2
MPI version:1.0
CUDA version:10.2.89
NCCL version:2.6.4.1
Python version:3.7
OS and version:Ubuntu 16.04.6 LTS
GCC version:gxx_linux-64 7.3.0

Checklist:

Did you search issues to find if somebody asked this question before?
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Your question:
When I use horovod to run programs like：
CUDA_VISIBLE_DEVICES=0,2 horovodrun -np 2 -H localhost:2 python xxxx.py
There will be a mistake like this：

[1,1]<stderr>:terminate called after throwing an instance of 'c10::Error'
[1,1]<stderr>:  what():  CUDA error: device not ready (Ready at horovod/torch/ready_event.cc:92)
[1,1]<stderr>:frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fd94bc2eb5e in /data/shaozl/anaconda3/envs/pytorch/lib/pyth
on3.7/site-packages/torch/lib/libc10.so)
[1,1]<stderr>:frame #1: horovod::torch::TorchReadyEvent::Ready() const + 0x11b (0x7fd939fa283b in /data/shaozl/anaconda3/envs/pytorch/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,1]<stderr>:frame #2: <unknown function> + 0x75f64 (0x7fd939f0ff64 in /data/shaozl/anaconda3/envs/pytorch/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,1]<stderr>:frame #3: <unknown function> + 0xc819d (0x7fd97f29c19d in /data/shaozl/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
[1,1]<stderr>:frame #4: <unknown function> + 0x76db (0x7fd9a0e5a6db in /lib/x86_64-linux-gnu/libpthread.so.0)
[1,1]<stderr>:frame #5: clone + 0x3f (0x7fd9a0b8388f in /lib/x86_64-linux-gnu/libc.so.6)

This problem 100% appear in the case of 2 gpus，and 50% probability appearing of 1 gpu。

And I also tried the official _minist_ code on Pytorch，this problem will also appear. The difference is that in _minist_ sometimes this problem will not occur with 2 gpus.

question

Source

zanonShao

Most helpful comment

I reinstalled the whole environment and the problem was solved. ORZ

Hi @zanonShao, I also met this issue, what do you mean "reinstalled the whole environment"?

I installed all the enviroment by conda：
1、CUDA conda install -c anaconda cudatoolkit
2、pytorch conda install pytorch torchvision -c pytorch （if it's slow to download by channel 'pytorch',you can try https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/）
3、MPI conda install -c conda-forge mpi and conda install -c conda-forge mpi4py
4、NCCL conda install -c conda-forge nccl
5、gxx_64 conda install -c conda-forge gxx_impl_linux-64
Finally use pip to install horovod
6、pip install --no-cache-dir horovod

That's ok for me. Good luck.

zanonShao on 4 Jul 2020

👍2

All 3 comments

I reinstalled the whole environment and the problem was solved. ORZ

zanonShao on 2 Jul 2020

I reinstalled the whole environment and the problem was solved. ORZ

Hi @zanonShao, I also met this issue, what do you mean "reinstalled the whole environment"?

Approximetal on 4 Jul 2020

I reinstalled the whole environment and the problem was solved. ORZ

Hi @zanonShao, I also met this issue, what do you mean "reinstalled the whole environment"?