Horovod: ERROR: had broadcast meets error: broadcast.*****conv2.weight [missing ranks 1, 2, 3]

Created on 4 Sep 2018  路  3Comments  路  Source: horovod/horovod

hi, all

I try to take had for multi nodes multi gnu training, however, I met a problem
environment:
linux 16.04 with docker, pytorch 0.4.0, horovod
mpirun np=32, and I have 4 nodes, each with 8 gpu

when doing broadcasting, it will show error msg as: broadcast meets error: broadcast.*conv2.weight [missing ranks 1, 2, 3]

at the same time, there will be a small amount of gnu memory occupied on the node 1, 2, 3, and the node 0 training can start training.

why It is so? I followed the example on official website, and looks nothing is wrong.

question wontfix

Most helpful comment

sorry guys for a late update

I have resolved the above problem by upgrade the horovod version

All 3 comments

Hi @waynezhang2018, can you upload a gist of the code you're running? It looks like rank 0 is broadcasting, but the other ranks are not awaiting the broadcast.

sorry guys for a late update

I have resolved the above problem by upgrade the horovod version

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zeyu-hello picture zeyu-hello  路  3Comments

chentingpc picture chentingpc  路  3Comments

dhaners picture dhaners  路  3Comments

guoyuanxiong picture guoyuanxiong  路  3Comments

zanonShao picture zanonShao  路  3Comments