Pytorch-lightning: `model.test()` can fail for `ddp` because `args` in `evaluation_forward` are malformed

Created on 9 May 2020 · 8Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

model.test() can fail while training via dp because TrainerEvaluationLoopMixin.evaluation_forward doesn't handle an edge case.

To Reproduce

Attempt to model.test() any lightning model in dp mode (I believe it fails in any of the modes at https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/trainer/evaluation_loop.py#L420).

_Note that the validation and training steps work well, but test fails._

The bottom of the stack trace isn't super elucidating but the crux of the matter is captured in

    411     def evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode: bool = False):
    412         # make dataloader_idx arg in validation_step optional
    413         args = [batch, batch_idx]
    414 
    415         if (test_mode and len(self.test_dataloaders) > 1) \
    416                 or (not test_mode and len(self.val_dataloaders) > 1):
    417             args.append(dataloader_idx)
    418 
    419         # handle DP, DDP forward
    420         if self.use_ddp or self.use_dp or self.use_ddp2:
--> 421             output = model(*args)
    422             return output

At line 421 the code that _will_ run is output = model(*args[0][:-1]) but other things fail downstream of that hack. Note args[0] is the tuple of tensors and the last tensor is the target.

TL;DR: at this point (for test -- again val and train work perfectly) I believe that what we want is something similar to output = model.test_step(*args) instead (see later on in evaluation_forward, below the above trace).

However, i realized that the model, now a LightningDataParallel instance, no longer has the test_step that is defined in the original LightningModule, so my understanding of the system for making multi-GPU work is a limiting factor here.

This mock, I thought, would resolve the issue for me, but I then realized that the test_step method no longer existed per the above paragraph:

ORIG = pl.trainer.evaluation_loop.TrainerEvaluationLoopMixin.evaluation_forward

def _mock_evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode: bool = False):
    if not test_mode or (not (self.use_ddp or self.use_dp or self.use_ddp2)):
        return ORIG(self, model, batch, batch_idx, dataloader_idx, test_mode)

    output = model.test_step(*args)
    return output

from unittest import mock

@mock.patch('pytorch_lightning.trainer.evaluation_loop.TrainerEvaluationLoopMixin.evaluation_forward', _mock_evaluation_forward)
def train_my_model(): ...

Additional context

Thanks for the great library! I can't precisely determine why train and eval work and then test fails. One thing to note is that the forward method to my model takes several tensors, not just one, which is a possible factor. Everything works perfectly with dp turned off.

Priority P0 bug / fix good first issue help wanted

Source

mstewart141

Most helpful comment

Fixed! 0.8.5

williamFalcon on 10 Jul 2020

🚀1 🎉1

All 8 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 9 May 2020

ummm. good catch. Want to submit a PR? we'll help you finish it

williamFalcon on 10 May 2020

@williamFalcon I'd like to take this up if no one else is working on it. Please tellme if its free in order to avoid duplication of effort.

singhularAdi on 11 May 2020

@gingerboredman i've not looked into it any further yet, so i think you're likely in the clear if you want to give it a shot!

mstewart141 on 12 May 2020

@mstewart141 yeah cool, I'm on it. Thanks 😁

singhularAdi on 12 May 2020

❤1

For some reason, one of my gpus stopped working. Any way to work around this as i cant test any changes i make with ddp enabled and my new GPU will take a week or so to arrive? If not, id advise reassigning this task as it would cause further delay

@williamFalcon
@mstewart141

singhularAdi on 14 May 2020

@williamFalcon @mstewart141 Could i get some help regarding my previous message?

singhularAdi on 18 May 2020

Fixed! 0.8.5

williamFalcon on 10 Jul 2020

🚀1 🎉1

Was this page helpful?

0 / 5 - 0 ratings