model.test() can fail while training via dp because TrainerEvaluationLoopMixin.evaluation_forward doesn't handle an edge case.
Attempt to model.test() any lightning model in dp mode (I believe it fails in any of the modes at https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/trainer/evaluation_loop.py#L420).
_Note that the validation and training steps work well, but test fails._
The bottom of the stack trace isn't super elucidating but the crux of the matter is captured in
411 def evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode: bool = False):
412 # make dataloader_idx arg in validation_step optional
413 args = [batch, batch_idx]
414
415 if (test_mode and len(self.test_dataloaders) > 1) \
416 or (not test_mode and len(self.val_dataloaders) > 1):
417 args.append(dataloader_idx)
418
419 # handle DP, DDP forward
420 if self.use_ddp or self.use_dp or self.use_ddp2:
--> 421 output = model(*args)
422 return output
At line 421 the code that _will_ run is output = model(*args[0][:-1]) but other things fail downstream of that hack. Note args[0] is the tuple of tensors and the last tensor is the target.
TL;DR: at this point (for test -- again val and train work perfectly) I believe that what we want is something similar to output = model.test_step(*args) instead (see later on in evaluation_forward, below the above trace).
However, i realized that the model, now a LightningDataParallel instance, no longer has the test_step that is defined in the original LightningModule, so my understanding of the system for making multi-GPU work is a limiting factor here.
This mock, I thought, would resolve the issue for me, but I then realized that the test_step method no longer existed per the above paragraph:
ORIG = pl.trainer.evaluation_loop.TrainerEvaluationLoopMixin.evaluation_forward
def _mock_evaluation_forward(self, model, batch, batch_idx, dataloader_idx, test_mode: bool = False):
if not test_mode or (not (self.use_ddp or self.use_dp or self.use_ddp2)):
return ORIG(self, model, batch, batch_idx, dataloader_idx, test_mode)
output = model.test_step(*args)
return output
from unittest import mock
@mock.patch('pytorch_lightning.trainer.evaluation_loop.TrainerEvaluationLoopMixin.evaluation_forward', _mock_evaluation_forward)
def train_my_model(): ...
Thanks for the great library! I can't precisely determine why train and eval work and then test fails. One thing to note is that the forward method to my model takes several tensors, not just one, which is a possible factor. Everything works perfectly with dp turned off.
Hi! thanks for your contribution!, great first issue!
ummm. good catch. Want to submit a PR? we'll help you finish it
@williamFalcon I'd like to take this up if no one else is working on it. Please tellme if its free in order to avoid duplication of effort.
@gingerboredman i've not looked into it any further yet, so i think you're likely in the clear if you want to give it a shot!
@mstewart141 yeah cool, I'm on it. Thanks 馃榿
For some reason, one of my gpus stopped working. Any way to work around this as i cant test any changes i make with ddp enabled and my new GPU will take a week or so to arrive? If not, id advise reassigning this task as it would cause further delay
@williamFalcon
@mstewart141
@williamFalcon @mstewart141 Could i get some help regarding my previous message?
Fixed! 0.8.5
Most helpful comment
Fixed! 0.8.5