hi, all
I try to take had for multi nodes multi gnu training, however, I met a problem
environment:
linux 16.04 with docker, pytorch 0.4.0, horovod
mpirun np=32, and I have 4 nodes, each with 8 gpu
when doing broadcasting, it will show error msg as: broadcast meets error: broadcast.*conv2.weight [missing ranks 1, 2, 3]
at the same time, there will be a small amount of gnu memory occupied on the node 1, 2, 3, and the node 0 training can start training.
why It is so? I followed the example on official website, and looks nothing is wrong.
Hi @waynezhang2018, can you upload a gist of the code you're running? It looks like rank 0 is broadcasting, but the other ranks are not awaiting the broadcast.
sorry guys for a late update
I have resolved the above problem by upgrade the horovod version
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
sorry guys for a late update
I have resolved the above problem by upgrade the horovod version