Pytorch-lightning: Any plans for going beyond SLURM?

Created on 20 Dec 2019 · 1Comment · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

It would be nice to have support for various cloud clusters (e.g. we're using mostly Azure VMs and VM scale sets for training).

Motivation

Spinning up a VM cluster and starting SLURM each time sounds unoptimal.

We could maybe assume that a cluster has been set up and MASTER_ADDR and MASTER_PORT env variables have been set up for torch.dist.init_process_group, since this would probably always require very cloud-specific scripting etc.

I could maybe try to take a stab at this if pointed to the right direction, although I'm definitely not a backend/ ML infra engineer or anything...

... or maybe it _is_ easiest to just set up SLURM... opinions, anyone?

enhancement help wanted

Source

harpone

👍3

Most helpful comment

yes! currently working on that :)
ETA should be a few weeks. we'll first support AWS then Google Cloud!

williamFalcon on 14 Jan 2020

👍3

>All comments

yes! currently working on that :)
ETA should be a few weeks. we'll first support AWS then Google Cloud!

williamFalcon on 14 Jan 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How set number of epochs

Vichoko · 3Comments

Access the logging directory through LightningModule or Trainer

DavidRuhe · 3Comments

Namespace Cleaning

monney · 3Comments

Managing Checkpoints

srush · 3Comments

How to use pytorch-lightning to run GAN？

as754770178 · 3Comments