Pytorch-lightning: Multiple Model Multiple Loss Unsupervised Training

Created on 31 Jul 2020 · 21Comments · Source: PyTorchLightning/pytorch-lightning

Hi, recently I have been trying to standardize on of our research models which led me to Lightning.

I have a situation around Multi-model multi loss training, which is described in the below post:

https://discuss.pytorch.org/t/multiple-networks-multiple-losses/91130?u=pavanmv

Please let me know if this can be achieved using LIghtning.

question won't fix

Source

MVPavan

All 21 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 31 Jul 2020

@MVPavan I don't see any reason why this should not work in lightning. Lightning allows you to define multiple optimizers, all you need is to pass the correct model parameters to the correct optimizer while defining them. You can refer to this documentation: https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.html#pytorch_lightning.core.LightningModule.configure_optimizers

ananyahjha93 on 31 Jul 2020

@ananyahjha93 but how the loss.step() will work here? I see in the diagram given in the link each loss's backward is done separately. Will it be equivalent to adding all the losses and doing backward once for the total loss?

rohitgr7 on 31 Jul 2020

@ananyahjha93 in my case loss2.backward () and opt2.step() has to be called only after calling loss1.backward() and opt1.step() which updates all the grads of NN1 and flushes all grads of NN2. Correct me if any gap in my understanding.

MVPavan on 31 Jul 2020

okay, I misunderstood the gradient flow here, now I got it. @MVPavan training_step is called n times where n is the number of optimizers. So with multiple optimizers in training_step you will get the optimizer_idx. All you need is to setup some if-else conditions in your training_step for each optimizer. Also, you need to store the outputs from the previous step (optimizer_idx=0) like for loss1 you used out1 and out2, you need to store them and use the same in next iteration of training_step for optimizer_idx=1. After each iteration of the training step, for each optimizer, loss.backward() and optimizer.step() will be done for optimizer at optimizer_idx.

rohitgr7 on 31 Jul 2020

👍1

@rohitgr7 thanks for this answer!

@MVPavan
Also, since you are storing something like self.loss1 (like in the answer above) in the first call to training_step and then expecting it to be available in the next call, use DDP for multiple GPUs instead of DP. DP doesn't support state maintenance as of now.

ananyahjha93 on 31 Jul 2020

since you are storing something like self.loss1 (like in the answer above)

you need to store the outputs from the previous step

Not the loss but outputs :)

Also, I am thinking of the inefficiency of loss1.backward() for NN2, since it's not updated with optimizer1 with the gradients calculated using loss1.backward(). I was thinking of a way to freeze the NN2 weights with optimizer_idx=0, calculate out2, store it, now for optimizer_idx1 unfreeze NN2, use the store out2 to calculate the loss2 and do the loss2.backward(), will the gradients of NN2 be created here since we used out2 calculated when NN2 was freezed??

rohitgr7 on 31 Jul 2020

@rohitgr7 yeah, sorry about that. Thanks for the correction! what about

with torch.no_grad()

ananyahjha93 on 31 Jul 2020

👍1

But still, inefficiency remains here. The efficient way I think would be to calculate out1 and out2 only once and store it with optimizer_idx=0 in training_step but with loss1.backward() don't calculate gradients for NN2 and with loss2.backward() don't calculate gradients for NN1. ~Can't figure out a way to make this work like this.~

Edit: to avoid gradient calculation just use .detach but then why even have it there since losses are added.

rohitgr7 on 31 Jul 2020

@rohitgr7 thanks for the explanation. In pytorch once we get all the outputs for one iteration, if we call loss1.backward it calculates all the grads for NN1 and NN2, now opt1.step will just update weights of NN1 as it is defined for just NN1 and then zeros all the grads of NN1 and NN2.

Now loss2.backward and opt2.step will do the same for NN2. This way we can achieve both in single training step. Correct me if I am missing something.

Now I wanted to understand can this be achieved using lighting.

Thanks in advance!

MVPavan on 1 Aug 2020

then zeros all the grass of NN1 and NN2.

Not for both just for NN1 since NN1 parameters are defined within opt1 and opt1.zero_grad() will affect NN1 parameters only. I guess here in the first iteration you can do NN2.zero_grad() to achieve this. Now in this first iteration save out1 and out2 using self and use them again in the next iteration with optimizer_idx=1. Also in this second iteration, you need to zero_grad NN1 manually else it will affect the gradient in the next batch. For opt3 and opt45 this is simple I guess, no manual work. Why don't you draft the LightningModule once, paste it here, they will try to help in a better way? Do you want the gradient calculated for NN2 in the first iteration to be updated with opt2.step()?

rohitgr7 on 1 Aug 2020

@rohitgr7 Sorry I haven't yet started modifying the code to lightning format it's very large code base, still in the evaluation phase. Anyway to your other question I want both NN1, NN2 to get updated with every iteration.
Modified equation

MVPavan on 7 Aug 2020

How the following solution can be achieved using lightning?

L1 = f1(out1, out2.detach())
L2 = g1(out1.detach(), out2)

L1.backward()
Opt1.step()
Opt1.zero_grad()

L2.backward()
Opt2.step()
Opt2.zero_grad()

MVPavan on 14 Aug 2020

Try:

def training_step(self, batch, batch_idx, optimizer_idx):
    if optimizer_idx == 0:
        self.out1 = self.net1(batch)
        self.out2 = self.net2(batch)
        loss1 = f1(self.out1, self.out2.detach())
        return {'loss': loss1}
    elif optimizer_idx == 1:
        loss2 = g1(self.out1.detach(), self.out2)
        return {'loss': loss2}

def configure_optimizer(self):
    opt1 = optimizer(self.net1.parameters())
    opt2 = optimizer(self.net2.parameters())
    return [opt1, opt2]

rohitgr7 on 14 Aug 2020

What about a case with single model, but multiple weighted losses?
I have this in my model:

def training_step(self, batch,  batch_idx):
    img_lr = batch['lr']
    img_hr = batch['hr']
    img_sr = self.forward(img_lr)

    losses = []
    losses_names = []
    for l in self._losses:
        loss = l.loss(img_sr, img_hr)
        effective_loss = l.weight * loss

        if len(self._losses) > 1:
            losses_names.append(l.name)
        else:
            losses_names.append('loss')
        losses.append(effective_loss)

    loss_sum = sum(losses)

    losses_dict = {n: l for n, l in zip(losses_names, losses)}
    if len(losses_names) > 1:
        losses_dict['loss'] = loss_sum

    self.log_dict(losses_dict, prog_bar=True, logger=False)
    return losses_dict

But then it gives this error:

Traceback (most recent call last):
  File "train.py", line 339, in <module>
    main(Model, args)
  File "train.py", line 293, in main
    trainer.fit(model, dataset)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 550, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/progress.py", line 339, in on_train_batch_end
    self.main_progress_bar.set_postfix(trainer.progress_bar_dict)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/properties.py", line 155, in progress_bar_dict
    return dict(**ref_model.get_progress_bar_dict(), **self.logger_connector.progress_bar_metrics)
TypeError: type object got multiple values for keyword argument 'loss'
Exception ignored in: <function tqdm.__del__ at 0x7f42130cab00>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1122, in __del__
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1335, in close
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1514, in display
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1125, in __repr__
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1475, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Should I manually disable showing loss in tqdm, then re-add manually?

george-gca on 16 Oct 2020

@george-gca by default the 'loss' from the losses_dict returned from the training_step will be added automatically. Since you are logging it explicitly in your code it will create a conflict here. Either you can disable it from your side or pop the 'loss' that was added by default by overriding get_progress_bar_dict.

rohitgr7 on 16 Oct 2020

@rohitgr7 so I changed my code to this

def training_step(self, batch,  batch_idx):
    img_lr = batch['lr']
    img_hr = batch['hr']
    img_sr = self.forward(img_lr)

    losses = []
    losses_names = []
    for l in self._losses:
        loss = l.loss(img_sr, img_hr)
        effective_loss = l.weight * loss

        losses_names.append(l.name)
        losses.append(effective_loss)

    losses_dict = {n: l for n, l in zip(losses_names, losses)}
    self.log_dict(losses_dict, prog_bar=True, logger=False)
    return losses_dict

and my resulting losses_dict looks like this: {'adaptive': tensor(1.2606, device='cuda:0', grad_fn=<MulBackward0>), 'l1': tensor(0.4253, device='cuda:0', grad_fn=<MulBackward0>)}. But it throws another error:

Traceback (most recent call last):
  File "train.py", line 345, in <module>
    main(Model, args)
  File "train.py", line 299, in main
    trainer.fit(model, dataset)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 678, in run_training_batch
    self.trainer.hiddens
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 760, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 328, in training_step
    closure_loss = closure_loss / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

george-gca on 16 Oct 2020

your training_step should either return nothing (None) or a loss tensor or a dict containing a 'loss' key.

Either you can disable it

By this, I meant to set prog_bar=False while logging the loss, or just don't log it at all since it will be logged automatically.

rohitgr7 on 16 Oct 2020

@rohitgr7 so how should I do it if I want to show all losses in the progress bar? Both isolated and the total loss? Do I have to implement my own progress bar, or is there any way of doing this only with the self.log_dict?

Edit: nevermind, I was blind of looking too much to my code. After a break I went back and found what I had to do. Thanks.

george-gca on 17 Oct 2020

👍1

Sure @pgmikhael the main thing was the hint that @rohitgr7 gave me. I basically create the dictionary without the loss key, log this dictionary to progress bar, then add the loss key and return it.

def training_step(self, batch,  batch_idx):
    img_lr = batch['lr']
    img_hr = batch['hr']
    img_sr = self.forward(img_lr)

    losses = []
    losses_names = []
    for l in self._losses:
        loss = l.loss(img_sr, img_hr)
        effective_loss = l.weight * loss

        losses_names.append(l.name)
        losses.append(effective_loss)

    losses_dict = {n: l for n, l in zip(losses_names, losses)}
    if len(losses_names) > 1:
        # add other losses to progress bar since total loss
        # is added automatically
        self.log_dict(losses_dict, prog_bar=True, logger=False)

    # do this for tensorboard logging
    losses_dict = {f'loss/{k}': v for k, v in losses_dict.items()}
    # training_step must always return None, a Tensor, or a dict with at least
    # one key being 'loss'
    losses_dict['loss'] = sum(losses)
    return losses_dict

george-gca on 17 Oct 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!