The docs state that "If you鈥檇 like to do something special with the outputs other than logging, implement __epoch_end." and gives the following example:
def training_step(self, batch, batch_idx):
result = pl.TrainResult(loss)
result.some_prediction = some_prediction
return result
def training_epoch_end(self, training_step_output_result):
all_train_predictions = training_step_output_result.some_prediction
training_step_output_result.some_new_prediction = some_new_prediction
return training_step_output_result
When the custom value is a Tensor, this usage fails in Result.padded_gather() method where all entries in the result are traversed, and their tbptt_pad_token meta property is checked, since the custom property doesn't have meta:
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\accelerators\gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1239, in run_pretrain_routine
self.train()
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 394, in train
self.run_training_epoch()
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 550, in run_training_epoch
self.run_training_epoch_end(epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 657, in run_training_epoch_end
epoch_output = self.__gather_result_across_time_and_optimizers(epoch_output)
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 733, in __gather_result_across_time_and_optimizers
gathered_opt_output = time_gathered_outputs[0].__class__.padded_gather(time_gathered_outputs)
File "C:\Users\yigit\Anaconda3\lib\site-packages\pytorch_lightning\core\step_result.py", line 338, in padded_gather
padding_key = meta[name]['tbptt_pad_token']
KeyError: 'probs'
Steps to reproduce the behavior:
Try train/val loop example with custom attribute in the Result objects documentation.
I don't know what padding in Result does, but I guess it must be applied to logged results only.
Hi! thanks for your contribution!, great first issue!
@williamFalcon @nateraw mind have look? 馃惏
The problem seems to only occur when the attribute some_prediction is of type torch.Tensor.
The following worked for me:
def training_step(self, batch, batch_idx):
result = pl.TrainResult(loss)
result.some_prediction = some_prediction.detatch().cpu().numpy()
return result
I get some weird issues where I save a tensor of shape (8, 8, 2) in each val_step to my result object as an attribute and with distributed_backend='dp' and a single GPU the value in val_epoch_end is of shape (72, 8) after 9 batches, the values don't seem to correlate at all with those that I save in each val_step. When I disable distributed_backend='dp' things work as expected and I get a list of 9 tensors each of shape (8, 8, 2).
Similar to the logic of the log method, I suggest adding a designated method to track a new metric (only storing it), which will invoke the __set_meta method, and then set the value.
We might even consider allowing usage of advanced features such as reduce_fx, so that users who override training_epoch_end will also receive the reduced results through the epoch.
Let me know if you like the idea, and I'll be happy to create a PR :)
Closing this since EvalResult and TrainResult are removed in v1.0.
Most helpful comment
The problem seems to only occur when the attribute
some_predictionis of typetorch.Tensor.The following worked for me: