Pytorch-lightning: Best way to integrate ray as Distributed Backend

Created on 1 Apr 2020  路  12Comments  路  Source: PyTorchLightning/pytorch-lightning

I want to use ray as a distributed backend, mainly to make use of their autoscaling capabilities. Since I'd like to upstream my solution afterwards, and also get some feedback, I'd like to check with y'all first.

After reading the docs on both sides and the current implementation of LightningDistributedDataParallel I think there are a two main options:

  1. Follow this guide and implement a tweaked LightningDistributedDataParallel, letting pytorch-lightning handle the training
  2. Use the LightningModules functions to instantiate a TorchTrainer instance and let ray handle the training loops.
  3. Merge pytorch_lightning.Trainer and TorchTrainer?

Currently I'm leaning towards option 1, Option 2 or 3 would give ray.tune's hyperparameter search for free, but it seem like quite a lot work.

question

Most helpful comment

this is awesome! would love upstream support for our lightning users!

All 12 comments

Hi! thanks for your contribution!, great first issue!

Alrighty, after a bunch more reading it turned out tweaking pytorch_lightning.Trainer and creating a ray wrapper was easier than expected. You can see a basic API here, it doesn't yet fit the current pytorch_lightning coding style though (I'm more used to component based development rather than mixins).

The changes can be seen here, basically I added a check for slurm wherever it was making use of a slurm based environment variable, then created a two stage wrapper, a "RemoteRayTrainer" which runs on the remote nodes and sets an "LIGHTNING_X" equivalent wherever there was a "SLURM_X" used before, and a RayTrainer which spawns the remote workers. I have not tested this on an actual cluster so far though, I'll need to check if it plays nicely when actually going over the network and not only localhost.

One big-ish problem I see is that the decorator didn't like being used dynamically, which means I couldn't automatically set the num_gpus requirement.

This is enough for my own purposes if it works, but as I said I'd love to upstream this if desired. Feel free to request changes at will :-)

Hey, this is great! I'd be happy to move this upstream into Ray too, if we agree on a design.

You can use .options() to dynamically set resources for a Ray remote worker:

@ray.remote(num_gpus=1, max_calls=1, num_return_vals=2)
def f():
    return 1, 2
g = f.options(num_gpus=2, max_calls=None)
g.remote() # uses 2 gpus

Reference: https://ray.readthedocs.io/en/latest/package-ref.html#ray.remote

Happy to chat more offline!

this is awesome! would love upstream support for our lightning users!

Ray announced RaySGD
https://medium.com/distributed-computing-with-ray/faster-and-cheaper-pytorch-with-raysgd-a5a44d4fd220
It would be great to have an opportunity to train on preemptible instances.

A small update on this: my forked repo now contains a version of raytrainer that can directly connect to a ray cluster and use both ddp and ddp2 to train a model, I have to see how it interacts with callbacks now. Thanks to @richardliaw for the pointer to ".options"

hey @williamFalcon, @Borda, are you all still interested in this? any idea of what's needed to make this happen?

would be more than happy to help out if given some pointers!

hey! thanks for asking. Right now we have a feature freeze for new features until 1.0.0 (coming in the next few months).

We just want to stabilize the package before adding new things :)

We鈥檒l ping you once that鈥檚 ready to discuss how to do this.

Hi @williamFalcon and co! Highly interested in this as a Ray user myself. Is there any movement in this direction assuming that version 1.0.0 is arriving soon?

we鈥檙e almost done with 1.0. We can look into this afterwards!

thanks for mentioning this.

@amogkam has been working on integrating LightningModules with RaySGD: https://github.com/ray-project/ray/pull/11042. After this is merged into Ray, this will provide a way to use Ray for training with PTL. If you need hparam optimization, take a look at Using PyTorch Lightning with Tune

Hi @williamFalcon, any update on this? We'd be more than happy to explore an integration!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jcreinhold picture jcreinhold  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

versatran01 picture versatran01  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

edenlightning picture edenlightning  路  3Comments