Pytorch-lightning: Can't set result object predictions to tensors

Created on 25 Aug 2020  路  6Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

The docs state that "If you鈥檇 like to do something special with the outputs other than logging, implement __epoch_end." and gives the following example:

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction
    return result

def training_epoch_end(self, training_step_output_result):
    all_train_predictions = training_step_output_result.some_prediction

    training_step_output_result.some_new_prediction = some_new_prediction
    return training_step_output_result

When the custom value is a Tensor, this usage fails in Result.padded_gather() method where all entries in the result are traversed, and their tbptt_pad_token meta property is checked, since the custom property doesn't have meta:

  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1073, in fit
    results = self.accelerator_backend.train(model)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\accelerators\gpu_backend.py", line 51, in train
    results = self.trainer.run_pretrain_routine(model)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 394, in train
    self.run_training_epoch()
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 550, in run_training_epoch
    self.run_training_epoch_end(epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 657, in run_training_epoch_end
    epoch_output = self.__gather_result_across_time_and_optimizers(epoch_output)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 733, in __gather_result_across_time_and_optimizers
    gathered_opt_output = time_gathered_outputs[0].__class__.padded_gather(time_gathered_outputs)
  File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\core\step_result.py", line 338, in padded_gather
    padding_key = meta[name]['tbptt_pad_token']
KeyError: 'probs'

To Reproduce

Steps to reproduce the behavior:

Try train/val loop example with custom attribute in the Result objects documentation.

Code sample

Expected behavior


I don't know what padding in Result does, but I guess it must be applied to logged results only.

Environment

  • Packages:
    - numpy: 1.18.1
    - pyTorch_debug: False
    - pyTorch_version: 1.6.0
    - pytorch-lightning: 0.9.0
    - tensorboard: 2.2.0
    - tqdm: 4.47.0
  • System:
    - OS: Windows
    - architecture:
    - 64bit
    - WindowsPE
    - processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
    - python: 3.7.1
    - version: 10.0.19041

Additional context

API / design ResultObj bug / fix duplicate help wanted

Most helpful comment

The problem seems to only occur when the attribute some_prediction is of type torch.Tensor.
The following worked for me:

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction.detatch().cpu().numpy()
    return result

All 6 comments

Hi! thanks for your contribution!, great first issue!

@williamFalcon @nateraw mind have look? 馃惏

The problem seems to only occur when the attribute some_prediction is of type torch.Tensor.
The following worked for me:

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction.detatch().cpu().numpy()
    return result

I get some weird issues where I save a tensor of shape (8, 8, 2) in each val_step to my result object as an attribute and with distributed_backend='dp' and a single GPU the value in val_epoch_end is of shape (72, 8) after 9 batches, the values don't seem to correlate at all with those that I save in each val_step. When I disable distributed_backend='dp' things work as expected and I get a list of 9 tensors each of shape (8, 8, 2).

Similar to the logic of the log method, I suggest adding a designated method to track a new metric (only storing it), which will invoke the __set_meta method, and then set the value.

We might even consider allowing usage of advanced features such as reduce_fx, so that users who override training_epoch_end will also receive the reduced results through the epoch.

Let me know if you like the idea, and I'll be happy to create a PR :)

Closing this since EvalResult and TrainResult are removed in v1.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Vichoko picture Vichoko  路  3Comments

awaelchli picture awaelchli  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

edenlightning picture edenlightning  路  3Comments

mmsamiei picture mmsamiei  路  3Comments