Hi, recently I have been trying to standardize on of our research models which led me to Lightning.
I have a situation around Multi-model multi loss training, which is described in the below post:
https://discuss.pytorch.org/t/multiple-networks-multiple-losses/91130?u=pavanmv
Please let me know if this can be achieved using LIghtning.
Hi! thanks for your contribution!, great first issue!
@MVPavan I don't see any reason why this should not work in lightning. Lightning allows you to define multiple optimizers, all you need is to pass the correct model parameters to the correct optimizer while defining them. You can refer to this documentation: https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.html#pytorch_lightning.core.LightningModule.configure_optimizers
@ananyahjha93 but how the loss.step() will work here? I see in the diagram given in the link each loss's backward is done separately. Will it be equivalent to adding all the losses and doing backward once for the total loss?
@ananyahjha93 in my case loss2.backward () and opt2.step() has to be called only after calling loss1.backward() and opt1.step() which updates all the grads of NN1 and flushes all grads of NN2. Correct me if any gap in my understanding.
okay, I misunderstood the gradient flow here, now I got it. @MVPavan training_step is called n times where n is the number of optimizers. So with multiple optimizers in training_step you will get the optimizer_idx. All you need is to setup some if-else conditions in your training_step for each optimizer. Also, you need to store the outputs from the previous step (optimizer_idx=0) like for loss1 you used out1 and out2, you need to store them and use the same in next iteration of training_step for optimizer_idx=1. After each iteration of the training step, for each optimizer, loss.backward() and optimizer.step() will be done for optimizer at optimizer_idx.
@rohitgr7 thanks for this answer!
@MVPavan
Also, since you are storing something like self.loss1 (like in the answer above) in the first call to training_step and then expecting it to be available in the next call, use DDP for multiple GPUs instead of DP. DP doesn't support state maintenance as of now.
since you are storing something like self.loss1 (like in the answer above)
you need to store the outputs from the previous step
Not the loss but outputs :)
Also, I am thinking of the inefficiency of loss1.backward() for NN2, since it's not updated with optimizer1 with the gradients calculated using loss1.backward(). I was thinking of a way to freeze the NN2 weights with optimizer_idx=0, calculate out2, store it, now for optimizer_idx1 unfreeze NN2, use the store out2 to calculate the loss2 and do the loss2.backward(), will the gradients of NN2 be created here since we used out2 calculated when NN2 was freezed??
@rohitgr7 yeah, sorry about that. Thanks for the correction! what about
with torch.no_grad()
But still, inefficiency remains here. The efficient way I think would be to calculate out1 and out2 only once and store it with optimizer_idx=0 in training_step but with loss1.backward() don't calculate gradients for NN2 and with loss2.backward() don't calculate gradients for NN1. ~Can't figure out a way to make this work like this.~
Edit: to avoid gradient calculation just use .detach but then why even have it there since losses are added.
@rohitgr7 thanks for the explanation. In pytorch once we get all the outputs for one iteration, if we call loss1.backward it calculates all the grads for NN1 and NN2, now opt1.step will just update weights of NN1 as it is defined for just NN1 and then zeros all the grads of NN1 and NN2.
Now loss2.backward and opt2.step will do the same for NN2. This way we can achieve both in single training step. Correct me if I am missing something.
Now I wanted to understand can this be achieved using lighting.
Thanks in advance!
then zeros all the grass of NN1 and NN2.
Not for both just for NN1 since NN1 parameters are defined within opt1 and opt1.zero_grad() will affect NN1 parameters only. I guess here in the first iteration you can do NN2.zero_grad() to achieve this. Now in this first iteration save out1 and out2 using self and use them again in the next iteration with optimizer_idx=1. Also in this second iteration, you need to zero_grad NN1 manually else it will affect the gradient in the next batch. For opt3 and opt45 this is simple I guess, no manual work. Why don't you draft the LightningModule once, paste it here, they will try to help in a better way? Do you want the gradient calculated for NN2 in the first iteration to be updated with opt2.step()?
@rohitgr7 Sorry I haven't yet started modifying the code to lightning format it's very large code base, still in the evaluation phase. Anyway to your other question I want both NN1, NN2 to get updated with every iteration.
Modified equation
How the following solution can be achieved using lightning?
L1 = f1(out1, out2.detach())
L2 = g1(out1.detach(), out2)
L1.backward()
Opt1.step()
Opt1.zero_grad()
L2.backward()
Opt2.step()
Opt2.zero_grad()
Try:
def training_step(self, batch, batch_idx, optimizer_idx):
if optimizer_idx == 0:
self.out1 = self.net1(batch)
self.out2 = self.net2(batch)
loss1 = f1(self.out1, self.out2.detach())
return {'loss': loss1}
elif optimizer_idx == 1:
loss2 = g1(self.out1.detach(), self.out2)
return {'loss': loss2}
def configure_optimizer(self):
opt1 = optimizer(self.net1.parameters())
opt2 = optimizer(self.net2.parameters())
return [opt1, opt2]
What about a case with single model, but multiple weighted losses?
I have this in my model:
def training_step(self, batch, batch_idx):
img_lr = batch['lr']
img_hr = batch['hr']
img_sr = self.forward(img_lr)
losses = []
losses_names = []
for l in self._losses:
loss = l.loss(img_sr, img_hr)
effective_loss = l.weight * loss
if len(self._losses) > 1:
losses_names.append(l.name)
else:
losses_names.append('loss')
losses.append(effective_loss)
loss_sum = sum(losses)
losses_dict = {n: l for n, l in zip(losses_names, losses)}
if len(losses_names) > 1:
losses_dict['loss'] = loss_sum
self.log_dict(losses_dict, prog_bar=True, logger=False)
return losses_dict
But then it gives this error:
Traceback (most recent call last):
File "train.py", line 339, in <module>
main(Model, args)
File "train.py", line 293, in main
trainer.fit(model, dataset)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train
self.train_loop.run_training_epoch()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 550, in run_training_epoch
self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 249, in on_train_batch_end
self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 823, in call_hook
trainer_hook(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 147, in on_train_batch_end
callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/progress.py", line 339, in on_train_batch_end
self.main_progress_bar.set_postfix(trainer.progress_bar_dict)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/properties.py", line 155, in progress_bar_dict
return dict(**ref_model.get_progress_bar_dict(), **self.logger_connector.progress_bar_metrics)
TypeError: type object got multiple values for keyword argument 'loss'
Exception ignored in: <function tqdm.__del__ at 0x7f42130cab00>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1122, in __del__
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1335, in close
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1514, in display
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1125, in __repr__
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1475, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Should I manually disable showing loss in tqdm, then re-add manually?
@george-gca by default the 'loss' from the losses_dict returned from the training_step will be added automatically. Since you are logging it explicitly in your code it will create a conflict here. Either you can disable it from your side or pop the 'loss' that was added by default by overriding get_progress_bar_dict.
@rohitgr7 so I changed my code to this
def training_step(self, batch, batch_idx):
img_lr = batch['lr']
img_hr = batch['hr']
img_sr = self.forward(img_lr)
losses = []
losses_names = []
for l in self._losses:
loss = l.loss(img_sr, img_hr)
effective_loss = l.weight * loss
losses_names.append(l.name)
losses.append(effective_loss)
losses_dict = {n: l for n, l in zip(losses_names, losses)}
self.log_dict(losses_dict, prog_bar=True, logger=False)
return losses_dict
and my resulting losses_dict looks like this: {'adaptive': tensor(1.2606, device='cuda:0', grad_fn=<MulBackward0>), 'l1': tensor(0.4253, device='cuda:0', grad_fn=<MulBackward0>)}. But it throws another error:
Traceback (most recent call last):
File "train.py", line 345, in <module>
main(Model, args)
File "train.py", line 299, in main
trainer.fit(model, dataset)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train
self.train_loop.run_training_epoch()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 678, in run_training_batch
self.trainer.hiddens
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 760, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 328, in training_step
closure_loss = closure_loss / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
your training_step should either return nothing (None) or a loss tensor or a dict containing a 'loss' key.
Either you can disable it
By this, I meant to set prog_bar=False while logging the loss, or just don't log it at all since it will be logged automatically.
@rohitgr7 so how should I do it if I want to show all losses in the progress bar? Both isolated and the total loss? Do I have to implement my own progress bar, or is there any way of doing this only with the self.log_dict?
Edit: nevermind, I was blind of looking too much to my code. After a break I went back and found what I had to do. Thanks.
Sure @pgmikhael the main thing was the hint that @rohitgr7 gave me. I basically create the dictionary without the loss key, log this dictionary to progress bar, then add the loss key and return it.
def training_step(self, batch, batch_idx):
img_lr = batch['lr']
img_hr = batch['hr']
img_sr = self.forward(img_lr)
losses = []
losses_names = []
for l in self._losses:
loss = l.loss(img_sr, img_hr)
effective_loss = l.weight * loss
losses_names.append(l.name)
losses.append(effective_loss)
losses_dict = {n: l for n, l in zip(losses_names, losses)}
if len(losses_names) > 1:
# add other losses to progress bar since total loss
# is added automatically
self.log_dict(losses_dict, prog_bar=True, logger=False)
# do this for tensorboard logging
losses_dict = {f'loss/{k}': v for k, v in losses_dict.items()}
# training_step must always return None, a Tensor, or a dict with at least
# one key being 'loss'
losses_dict['loss'] = sum(losses)
return losses_dict
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!