Pytorch-lightning: Any plans for going beyond SLURM?

Created on 20 Dec 2019  ·  1Comment  ·  Source: PyTorchLightning/pytorch-lightning

🚀 Feature

It would be nice to have support for various cloud clusters (e.g. we're using mostly Azure VMs and VM scale sets for training).

Motivation

Spinning up a VM cluster and starting SLURM each time sounds unoptimal.

We could maybe assume that a cluster has been set up and MASTER_ADDR and MASTER_PORT env variables have been set up for torch.dist.init_process_group, since this would probably always require very cloud-specific scripting etc.

I could maybe try to take a stab at this if pointed to the right direction, although I'm definitely not a backend/ ML infra engineer or anything...

... or maybe it _is_ easiest to just set up SLURM... opinions, anyone?

enhancement help wanted

Most helpful comment

yes! currently working on that :)
ETA should be a few weeks. we'll first support AWS then Google Cloud!

>All comments

yes! currently working on that :)
ETA should be a few weeks. we'll first support AWS then Google Cloud!

Was this page helpful?
0 / 5 - 0 ratings