Hi,
I often training network on cloud, and need copy data from local. I wonder whether there is op in horovod like torch.ditributed.barrier or can i use the torch.distributed.barrier to let all process wait until the data been copy completely? Thanks very much for any suggestion.
From my own use case, I've used mpi4py's comm.Barrier(), which is interoperable with Horovod.
Hey @pkuCactus, I would second @andfoy's suggestion to use mpi4py if using MPI. There are also some other good suggestions in #159. Now that we also support Gloo as an alternative to MPI, though, we may want to consider supporting something like this in the future as part of the Horovod API.
Thanks for the suggestions @andfoy @tgaddair , and I'll try it.
Most helpful comment
From my own use case, I've used mpi4py's
comm.Barrier(), which is interoperable with Horovod.