Environment:
Checklist:
Your question:
When I use horovod to run programs like:
CUDA_VISIBLE_DEVICES=0,2 horovodrun -np 2 -H localhost:2 python xxxx.py
There will be a mistake like this:
[1,1]<stderr>:terminate called after throwing an instance of 'c10::Error'
[1,1]<stderr>: what(): CUDA error: device not ready (Ready at horovod/torch/ready_event.cc:92)
[1,1]<stderr>:frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fd94bc2eb5e in /data/shaozl/anaconda3/envs/pytorch/lib/pyth
on3.7/site-packages/torch/lib/libc10.so)
[1,1]<stderr>:frame #1: horovod::torch::TorchReadyEvent::Ready() const + 0x11b (0x7fd939fa283b in /data/shaozl/anaconda3/envs/pytorch/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,1]<stderr>:frame #2: <unknown function> + 0x75f64 (0x7fd939f0ff64 in /data/shaozl/anaconda3/envs/pytorch/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so)
[1,1]<stderr>:frame #3: <unknown function> + 0xc819d (0x7fd97f29c19d in /data/shaozl/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
[1,1]<stderr>:frame #4: <unknown function> + 0x76db (0x7fd9a0e5a6db in /lib/x86_64-linux-gnu/libpthread.so.0)
[1,1]<stderr>:frame #5: clone + 0x3f (0x7fd9a0b8388f in /lib/x86_64-linux-gnu/libc.so.6)
This problem 100% appear in the case of 2 gpus,and 50% probability appearing of 1 gpu。
And I also tried the official _minist_ code on Pytorch,this problem will also appear. The difference is that in _minist_ sometimes this problem will not occur with 2 gpus.
I reinstalled the whole environment and the problem was solved. ORZ
I reinstalled the whole environment and the problem was solved. ORZ
Hi @zanonShao, I also met this issue, what do you mean "reinstalled the whole environment"?
I reinstalled the whole environment and the problem was solved. ORZ
Hi @zanonShao, I also met this issue, what do you mean "reinstalled the whole environment"?
I installed all the enviroment by conda:
1、CUDA conda install -c anaconda cudatoolkit
2、pytorch conda install pytorch torchvision -c pytorch (if it's slow to download by channel 'pytorch',you can try https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/)
3、MPI conda install -c conda-forge mpi and conda install -c conda-forge mpi4py
4、NCCL conda install -c conda-forge nccl
5、gxx_64 conda install -c conda-forge gxx_impl_linux-64
Finally use pip to install horovod
6、pip install --no-cache-dir horovod
That's ok for me. Good luck.
Most helpful comment
I installed all the enviroment by conda:
1、CUDA
conda install -c anaconda cudatoolkit2、pytorch
conda install pytorch torchvision -c pytorch(if it's slow to download by channel 'pytorch',you can try https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/)3、MPI
conda install -c conda-forge mpiandconda install -c conda-forge mpi4py4、NCCL
conda install -c conda-forge nccl5、gxx_64
conda install -c conda-forge gxx_impl_linux-64Finally use pip to install horovod
6、
pip install --no-cache-dir horovodThat's ok for me. Good luck.