Let's support this!
Forgive me if I'm wrong, but doesn't Lightning already provide many functions supported by DeepSpeed? Also, going by a cursory reading of DeepSpeed, isn't it just another wrapper for Pytorch? Or am I wrong?
i haven’t had a chance to read in depth but this is likely a library that operates on top of models which means lightning can use it
I think it's something like Lightning with more features based on the CIFAR example. One way is to make Lightning dependent on DeepSpeed for training related stuffs while Lightning focuses on reproducibility.
It's not like lightning at all lol. It's more like apex... or ddp.
To add support in lightning we need to create a flag and follow the readme instructions:
https://github.com/microsoft/DeepSpeed
Create a deepspeed object for the configs.
When the flag is enabled
Trainer(distributed_backend='deepspeed')
OR
Trainer(backend_engine='deepspeed')
The trainer does the following:
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
model=model,
model_parameters=params)
Note: We need to forward to training_step, validation_step and test_step accordingly. See DDP override.
for step, batch in enumerate(data_loader):
#forward() method
loss = model_engine(batch)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step()
model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)
We need to make sure when deepspeed is enabled to defer to the library so it can handle 16-bit and ddp.
Since the trainer flags have most of what's needed, we can automatically set up the config for the user (https://github.com/microsoft/DeepSpeed#deepspeed-configuration).
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": true,
"disable_allgather": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0
}
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
}
}
@neggert @jeffling anyone want to take this?
@luiscape might be a good issue to try?
@Borda also a good issue to start with
@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄
Awesome job!
@williamFalcon okay my bad I didn't read through it properly. Btw, I think the Training Optimizers, Advanced Parameter Search and Simplified Data Loader seems like good features to be included into Lightning if DeepSpeed backend is used. Or is it better for user to manually call it using the DeepSpeed library?
@xingzhaolee these are all features we should automatically enable when someone uses the deepspeed backend.
We should also make that configurable so users can modify it if they want to:
def configure_deepspeed(self, ...):
# do auto setup stuff for users
Then if you want a different way of doing this, override this function and add your own version.
@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄
Awesome job!
@williamFalcon Thanks for reaching out to us, this could be great. We are having internal discussions about how to proceed and will get back to you soon. We're also in the process of learning more about Lightning, it looks like great work you all have done :)
Hi team, this would be a huge addition to lightning!
The biggest barrier to entry for deepspeed is having to refactor all of my code, and the biggest attractor to lightning is the organization and simplicity it provides
what´s the latest on this? will deepspeed be integrated in lightning at some point soon? thank you :)
yup! we're actively working on this. Expect a v1 of it in the next few weeks via an rc. (cc @SeanNaren )
DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py
Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):
1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume
It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:
model = MegatronLM() # too_big_for_single_gpu_training
Trainer(
accelerator=‘ddp’,
num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory
Trainer(
accelerator=‘deepspeed’,
num_gpus=2
)
trainer.fit(model) # Actually trains!
I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)
sounds great! looking forward to the v1 ;)
yup! we're actively working on this. Expect a v1 of it in the next few weeks via an rc. (cc @SeanNaren )
DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py
Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):
1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume as DP; 2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume as DP; 3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd. For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is a modest 50% increase in communication volumeIt's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:
model = MegatronLM() # too_big_for_single_gpu_training Trainer( accelerator=‘ddp’, num_gpus=2 ) trainer.fit(model) # Crashes because of CUDA out of memory Trainer( accelerator=‘deepspeed’, num_gpus=2 ) trainer.fit(model) # Actually trains!I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)
FYI we have an implementation of the optimizer side in https://github.com/facebookresearch/fairscale/blob/master/fairscale/optim/oss.py, compatible with standard pytorch (ie. same param groups for instance, so that schedulers don't see a change). The issue with any implementation is that to get the full benefits you need to change the way the DP engine works though, that's true of 1) and 2) above. If you keep the normal pytorch DDP then the gradients are all-reduced and you waste some traffic. cc @ananthsub
thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?
we could drop the v1 to use the non optimized version first? then quickly move to a v2 where we modify the ddp stuff as well?
we already have lightningddp which modifies the original ddp a bit.
thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?
(with the standard DDP, using the linked optimizer as a drop-in "replacement" -more, wrap- to a normal optimizer) couple of percents if multi node, but that would depend on the interconnect. intra node it's actually faster on top of saving memory.
Now with more custom DDP like what deepspeed is doing there's a lot of potential in terms of speed and memory, but it's a little more compllicated to integrate, working on it. I can mention https://github.com/pytorch/pytorch/issues/42849 and https://github.com/pytorch/pytorch/issues/37002 here, ideally it would be nice to be able to customize the communication patterns without duplicating/forking
I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!
I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!
You can save a bit more if you own the communications, because you can release the gradients as soon as they have been reduced to the appropriate rank, that's 2) basically. So 1) is drop-in (usable with normal DDP and you get some savings), you can get a 1.5 of sorts by releasing all the now-useless gradients at the beginning of the sharded optimizer step (that's what the fairscale implementation above does), and 2) is when you drop the gradients as soon as possible, earlier. example, toy problem training a RN101 on 4 gpus, first is DDP, second is OSS+DDP, third is OSS+custom DDP (the losses should be exactly the same, fixing that)
Thanks @blefaudeux! We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further. Out of curiosity has there been any progress integrating gradient/parameter partitioning?
Thanks @blefaudeux!
of course !
We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further.
You might need to sync with Ananth (@ananthsub), within FB there's already a lightning/fairscale integration running, could be worth it unifying the efforts ?
Out of curiosity has there been any progress integrating gradient/parameter partitioning?
I've an experimental branch which gives these results currently (last chunk. 'OSS experimental'), following the ideas presented in this RFC (split the model in chunks, use autograd hooks to load/drop the parameters on the fly while keeping reasonably close to pytorch, each rank owns the optimization for one chunk only), very much WIP though. The savings depend a lot on the model size and optimizer, and with this test problem the activations dominate actually so it's not the most impressive usecase (still useful).
Most helpful comment
Hi team, this would be a huge addition to lightning!
The biggest barrier to entry for deepspeed is having to refactor all of my code, and the biggest attractor to lightning is the organization and simplicity it provides