Pytorch-lightning: Can't set result object predictions to tensors

Created on 25 Aug 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

The docs state that "If you’d like to do something special with the outputs other than logging, implement __epoch_end." and gives the following example:

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction
    return result

def training_epoch_end(self, training_step_output_result):
    all_train_predictions = training_step_output_result.some_prediction

    training_step_output_result.some_new_prediction = some_new_prediction
    return training_step_output_result

When the custom value is a Tensor, this usage fails in Result.padded_gather() method where all entries in the result are traversed, and their tbptt_pad_token meta property is checked, since the custom property doesn't have meta:

  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1073, in fit
    results = self.accelerator_backend.train(model)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\accelerators\gpu_backend.py", line 51, in train
    results = self.trainer.run_pretrain_routine(model)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 394, in train
    self.run_training_epoch()
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 550, in run_training_epoch
    self.run_training_epoch_end(epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 657, in run_training_epoch_end
    epoch_output = self.__gather_result_across_time_and_optimizers(epoch_output)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 733, in __gather_result_across_time_and_optimizers
    gathered_opt_output = time_gathered_outputs[0].__class__.padded_gather(time_gathered_outputs)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\core\step_result.py", line 338, in padded_gather
    padding_key = meta[name]['tbptt_pad_token']
KeyError: 'probs'

To Reproduce

Steps to reproduce the behavior:

Try train/val loop example with custom attribute in the Result objects documentation.

Code sample

Expected behavior

I don't know what padding in Result does, but I guess it must be applied to logged results only.

Environment

Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.47.0
System:
- OS: Windows
- architecture:
- 64bit
- WindowsPE
- processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
- python: 3.7.1
- version: 10.0.19041

Additional context

API / design ResultObj bug / fix duplicate help wanted

Source

ozen

👍1

Most helpful comment

The problem seems to only occur when the attribute some_prediction is of type torch.Tensor.
The following worked for me:

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction.detatch().cpu().numpy()
    return result

SAlexey on 27 Aug 2020

👍2

All 6 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 25 Aug 2020

@williamFalcon @nateraw mind have look? 🐰

Borda on 25 Aug 2020

The problem seems to only occur when the attribute some_prediction is of type torch.Tensor.
The following worked for me:

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction.detatch().cpu().numpy()
    return result

SAlexey on 27 Aug 2020

👍2

I get some weird issues where I save a tensor of shape (8, 8, 2) in each val_step to my result object as an attribute and with distributed_backend='dp' and a single GPU the value in val_epoch_end is of shape (72, 8) after 9 batches, the values don't seem to correlate at all with those that I save in each val_step. When I disable distributed_backend='dp' things work as expected and I get a list of 9 tensors each of shape (8, 8, 2).

willprice on 27 Aug 2020

Similar to the logic of the log method, I suggest adding a designated method to track a new metric (only storing it), which will invoke the __set_meta method, and then set the value.

We might even consider allowing usage of advanced features such as reduce_fx, so that users who override training_epoch_end will also receive the reduced results through the epoch.

Let me know if you like the idea, and I'll be happy to create a PR :)