Horovod: Any operator in horovod like torch.distributed.barrier?

Created on 5 Sep 2019 · 3Comments · Source: horovod/horovod

Hi,

I often training network on cloud, and need copy data from local. I wonder whether there is op in horovod like torch.ditributed.barrier or can i use the torch.distributed.barrier to let all process wait until the data been copy completely? Thanks very much for any suggestion.

question

Source

pkuCactus

Most helpful comment

From my own use case, I've used mpi4py's comm.Barrier(), which is interoperable with Horovod.

andfoy on 5 Sep 2019

👍2

All 3 comments

From my own use case, I've used mpi4py's comm.Barrier(), which is interoperable with Horovod.

andfoy on 5 Sep 2019

👍2

Hey @pkuCactus, I would second @andfoy's suggestion to use mpi4py if using MPI. There are also some other good suggestions in #159. Now that we also support Gloo as an alternative to MPI, though, we may want to consider supporting something like this in the future as part of the Horovod API.