Pytorch-lightning: Add deepspeed support

Created on 11 Feb 2020 · 22Comments · Source: PyTorchLightning/pytorch-lightning

Let's support this!

https://github.com/microsoft/DeepSpeed

Epic Important enhancement help wanted

Source

williamFalcon

👍29

Most helpful comment

Hi team, this would be a huge addition to lightning!

The biggest barrier to entry for deepspeed is having to refactor all of my code, and the biggest attractor to lightning is the organization and simplicity it provides

zbloss on 19 Jun 2020

👍5

All 22 comments

Forgive me if I'm wrong, but doesn't Lightning already provide many functions supported by DeepSpeed? Also, going by a cursory reading of DeepSpeed, isn't it just another wrapper for Pytorch? Or am I wrong?

sudarshan85 on 11 Feb 2020

i haven’t had a chance to read in depth but this is likely a library that operates on top of models which means lightning can use it

williamFalcon on 11 Feb 2020

I think it's something like Lightning with more features based on the CIFAR example. One way is to make Lightning dependent on DeepSpeed for training related stuffs while Lightning focuses on reproducibility.

xingzhaolee on 12 Feb 2020

It's not like lightning at all lol. It's more like apex... or ddp.

To add support in lightning we need to create a flag and follow the readme instructions:

https://github.com/microsoft/DeepSpeed

Api

Create a deepspeed object for the configs.

Code changes

When the flag is enabled

Trainer(distributed_backend='deepspeed')

OR 
Trainer(backend_engine='deepspeed')

The trainer does the following:

1. Init model, optimizers (like amp)

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

2. do a slightly different forward (like ddp)

Note: We need to forward to training_step, validation_step and test_step accordingly. See DDP override.

for step, batch in enumerate(data_loader):
    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()

3. do a slightly different thing for checkpoint saving

        model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

4. 16-bit and ddp

We need to make sure when deepspeed is enabled to defer to the library so it can handle 16-bit and ddp.

5. set up config automatically

Since the trainer flags have most of what's needed, we can automatically set up the config for the user (https://github.com/microsoft/DeepSpeed#deepspeed-configuration).

{
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "disable_allgather": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  }
}

@neggert @jeffling anyone want to take this?
@luiscape might be a good issue to try?
@Borda also a good issue to start with

williamFalcon on 12 Feb 2020

👍3 🚀1

@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄

Awesome job!

williamFalcon on 12 Feb 2020

@williamFalcon okay my bad I didn't read through it properly. Btw, I think the Training Optimizers, Advanced Parameter Search and Simplified Data Loader seems like good features to be included into Lightning if DeepSpeed backend is used. Or is it better for user to manually call it using the DeepSpeed library?

xingzhaolee on 12 Feb 2020

@xingzhaolee these are all features we should automatically enable when someone uses the deepspeed backend.

We should also make that configurable so users can modify it if they want to:

def configure_deepspeed(self, ...):
   # do auto setup stuff for users

Then if you want a different way of doing this, override this function and add your own version.

williamFalcon on 12 Feb 2020

👍1

@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄

Awesome job!

@williamFalcon Thanks for reaching out to us, this could be great. We are having internal discussions about how to proceed and will get back to you soon. We're also in the process of learning more about Lightning, it looks like great work you all have done :)

jeffra on 13 Feb 2020

👍2

Hi team, this would be a huge addition to lightning!

The biggest barrier to entry for deepspeed is having to refactor all of my code, and the biggest attractor to lightning is the organization and simplicity it provides

zbloss on 19 Jun 2020

👍5

https://github.com/pytorch/pytorch/issues/42849

edenlightning on 17 Aug 2020

what´s the latest on this? will deepspeed be integrated in lightning at some point soon? thank you :)

javismiles on 14 Oct 2020

yup! we're actively working on this. Expect a v1 of it in the next few weeks via an rc. (cc @SeanNaren )

williamFalcon on 14 Oct 2020

🚀4

DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py

Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):

1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume

It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:

model = MegatronLM() # too_big_for_single_gpu_training

Trainer(
    accelerator=‘ddp’,
    num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory

Trainer(
    accelerator=‘deepspeed’,
    num_gpus=2
)

trainer.fit(model) # Actually trains!

I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)

SeanNaren on 14 Oct 2020

sounds great! looking forward to the v1 ;)

yup! we're actively working on this. Expect a v1 of it in the next few weeks via an rc. (cc @SeanNaren )

javismiles on 14 Oct 2020

DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py

Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):
1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume
It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:
model = MegatronLM() # too_big_for_single_gpu_training

Trainer(
  accelerator=‘ddp’,
  num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory

Trainer(
  accelerator=‘deepspeed’,
  num_gpus=2
)

trainer.fit(model) # Actually trains!
I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)

FYI we have an implementation of the optimizer side in https://github.com/facebookresearch/fairscale/blob/master/fairscale/optim/oss.py, compatible with standard pytorch (ie. same param groups for instance, so that schedulers don't see a change). The issue with any implementation is that to get the full benefits you need to change the way the DP engine works though, that's true of 1) and 2) above. If you keep the normal pytorch DDP then the gradients are all-reduced and you waste some traffic. cc @ananthsub

blefaudeux on 14 Oct 2020

👍1

thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?

SeanNaren on 14 Oct 2020

👍2

we could drop the v1 to use the non optimized version first? then quickly move to a v2 where we modify the ddp stuff as well?

we already have lightningddp which modifies the original ddp a bit.

williamFalcon on 14 Oct 2020

thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?

(with the standard DDP, using the linked optimizer as a drop-in "replacement" -more, wrap- to a normal optimizer) couple of percents if multi node, but that would depend on the interconnect. intra node it's actually faster on top of saving memory.

Now with more custom DDP like what deepspeed is doing there's a lot of potential in terms of speed and memory, but it's a little more compllicated to integrate, working on it. I can mention https://github.com/pytorch/pytorch/issues/42849 and https://github.com/pytorch/pytorch/issues/37002 here, ideally it would be nice to be able to customize the communication patterns without duplicating/forking

blefaudeux on 14 Oct 2020

I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!

SeanNaren on 14 Oct 2020

I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!

You can save a bit more if you own the communications, because you can release the gradients as soon as they have been reduced to the appropriate rank, that's 2) basically. So 1) is drop-in (usable with normal DDP and you get some savings), you can get a 1.5 of sorts by releasing all the now-useless gradients at the beginning of the sharded optimizer step (that's what the fairscale implementation above does), and 2) is when you drop the gradients as soon as possible, earlier. example, toy problem training a RN101 on 4 gpus, first is DDP, second is OSS+DDP, third is OSS+custom DDP (the losses should be exactly the same, fixing that)

blefaudeux on 14 Oct 2020

👍2

Thanks @blefaudeux! We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further. Out of curiosity has there been any progress integrating gradient/parameter partitioning?

SeanNaren on 14 Oct 2020

Thanks @blefaudeux!

of course !

We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further.

You might need to sync with Ananth (@ananthsub), within FB there's already a lightning/fairscale integration running, could be worth it unifying the efforts ?

Out of curiosity has there been any progress integrating gradient/parameter partitioning?

I've an experimental branch which gives these results currently (last chunk. 'OSS experimental'), following the ideas presented in this RFC (split the model in chunks, use autograd hooks to load/drop the parameters on the fly while keeping reasonably close to pytorch, each rank owns the optimization for one chunk only), very much WIP though. The savings depend a lot on the model size and optimizer, and with this test problem the activations dominate actually so it's not the most impressive usecase (still useful).

blefaudeux on 15 Oct 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings