Pytorch-lightning: Best way to integrate ray as Distributed Backend

Created on 1 Apr 2020 · 12Comments · Source: PyTorchLightning/pytorch-lightning

I want to use ray as a distributed backend, mainly to make use of their autoscaling capabilities. Since I'd like to upstream my solution afterwards, and also get some feedback, I'd like to check with y'all first.

After reading the docs on both sides and the current implementation of LightningDistributedDataParallel I think there are a two main options:

Follow this guide and implement a tweaked LightningDistributedDataParallel, letting pytorch-lightning handle the training
Use the LightningModules functions to instantiate a TorchTrainer instance and let ray handle the training loops.
Merge pytorch_lightning.Trainer and TorchTrainer?

Currently I'm leaning towards option 1, Option 2 or 3 would give ray.tune's hyperparameter search for free, but it seem like quite a lot work.

question

Source

igor-krawczuk

👀7 👍2

Most helpful comment

this is awesome! would love upstream support for our lightning users!

williamFalcon on 2 Apr 2020

❤3

All 12 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 1 Apr 2020

Alrighty, after a bunch more reading it turned out tweaking pytorch_lightning.Trainer and creating a ray wrapper was easier than expected. You can see a basic API here, it doesn't yet fit the current pytorch_lightning coding style though (I'm more used to component based development rather than mixins).

The changes can be seen here, basically I added a check for slurm wherever it was making use of a slurm based environment variable, then created a two stage wrapper, a "RemoteRayTrainer" which runs on the remote nodes and sets an "LIGHTNING_X" equivalent wherever there was a "SLURM_X" used before, and a RayTrainer which spawns the remote workers. I have not tested this on an actual cluster so far though, I'll need to check if it plays nicely when actually going over the network and not only localhost.

One big-ish problem I see is that the decorator didn't like being used dynamically, which means I couldn't automatically set the num_gpus requirement.

This is enough for my own purposes if it works, but as I said I'd love to upstream this if desired. Feel free to request changes at will :-)

igor-krawczuk on 1 Apr 2020

Hey, this is great! I'd be happy to move this upstream into Ray too, if we agree on a design.

You can use .options() to dynamically set resources for a Ray remote worker:

@ray.remote(num_gpus=1, max_calls=1, num_return_vals=2)
def f():
    return 1, 2
g = f.options(num_gpus=2, max_calls=None)
g.remote() # uses 2 gpus

Reference: https://ray.readthedocs.io/en/latest/package-ref.html#ray.remote

Happy to chat more offline!

richardliaw on 2 Apr 2020

👍1

this is awesome! would love upstream support for our lightning users!

williamFalcon on 2 Apr 2020

❤3

Ray announced RaySGD
https://medium.com/distributed-computing-with-ray/faster-and-cheaper-pytorch-with-raysgd-a5a44d4fd220
It would be great to have an opportunity to train on preemptible instances.

renat-abbyazov on 8 Apr 2020

👍2

A small update on this: my forked repo now contains a version of raytrainer that can directly connect to a ray cluster and use both ddp and ddp2 to train a model, I have to see how it interacts with callbacks now. Thanks to @richardliaw for the pointer to ".options"

igor-krawczuk on 8 Apr 2020

👍1

hey @williamFalcon, @Borda, are you all still interested in this? any idea of what's needed to make this happen?

would be more than happy to help out if given some pointers!

richardliaw on 12 Jul 2020

hey! thanks for asking. Right now we have a feature freeze for new features until 1.0.0 (coming in the next few months).

We just want to stabilize the package before adding new things :)

We’ll ping you once that’s ready to discuss how to do this.

williamFalcon on 12 Jul 2020

Hi @williamFalcon and co! Highly interested in this as a Ray user myself. Is there any movement in this direction assuming that version 1.0.0 is arriving soon?

dukeeagle on 1 Oct 2020

we’re almost done with 1.0. We can look into this afterwards!

thanks for mentioning this.

williamFalcon on 1 Oct 2020

@amogkam has been working on integrating LightningModules with RaySGD: https://github.com/ray-project/ray/pull/11042. After this is merged into Ray, this will provide a way to use Ray for training with PTL. If you need hparam optimization, take a look at Using PyTorch Lightning with Tune