It would be nice to have support for various cloud clusters (e.g. we're using mostly Azure VMs and VM scale sets for training).
Spinning up a VM cluster and starting SLURM each time sounds unoptimal.
We could maybe assume that a cluster has been set up and MASTER_ADDR and MASTER_PORT env variables have been set up for torch.dist.init_process_group, since this would probably always require very cloud-specific scripting etc.
I could maybe try to take a stab at this if pointed to the right direction, although I'm definitely not a backend/ ML infra engineer or anything...
... or maybe it _is_ easiest to just set up SLURM... opinions, anyone?
yes! currently working on that :)
ETA should be a few weeks. we'll first support AWS then Google Cloud!
Most helpful comment
yes! currently working on that :)
ETA should be a few weeks. we'll first support AWS then Google Cloud!