Horovod: ERROR: had broadcast meets error: broadcast.*****conv2.weight [missing ranks 1, 2, 3]

Created on 4 Sep 2018 · 3Comments · Source: horovod/horovod

hi, all

I try to take had for multi nodes multi gnu training, however, I met a problem
environment:
linux 16.04 with docker, pytorch 0.4.0, horovod
mpirun np=32, and I have 4 nodes, each with 8 gpu

when doing broadcasting, it will show error msg as: broadcast meets error: broadcast.*conv2.weight [missing ranks 1, 2, 3]

at the same time, there will be a small amount of gnu memory occupied on the node 1, 2, 3, and the node 0 training can start training.

why It is so? I followed the example on official website, and looks nothing is wrong.

question wontfix

Source

waynezhang2018

Most helpful comment

sorry guys for a late update

I have resolved the above problem by upgrade the horovod version

waynezhang2018 on 22 Sep 2018

👍2

All 3 comments

Hi @waynezhang2018, can you upload a gist of the code you're running? It looks like rank 0 is broadcasting, but the other ranks are not awaiting the broadcast.

tgaddair on 5 Sep 2018

sorry guys for a late update

I have resolved the above problem by upgrade the horovod version

waynezhang2018 on 22 Sep 2018

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.