Pytorch-lightning: Missing attribute "training_step_output_for_epoch_end"

Created on 28 Sep 2020 · 12Comments · Source: PyTorchLightning/pytorch-lightning

I used the documentation way of stopping the training (https://pytorch-lightning.readthedocs.io/en/latest/early_stopping.html#enable-early-stopping-using-callbacks-on-epoch-end).

If on_bath_start method returns -1 at the very beginning of an epoch, the titled AttributeError exception.
The problem is in training_loop.py line 496 (batch_output.training_step_output_for_epoch_end).

Code sample

Use the method and run your code:

    def on_batch_start(self, batch):
        return -1

Expected behavior

Check batch_output value if equals -1 before running trainin_loop.py line 495.
The early stopping method achieved the same way the documentation specifies should not throw an exception but rather simply stop the training.

Environment

CUDA:
- GPU:
- available: False
- version: None
Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tqdm: 4.49.0
System:
- OS: Windows
- architecture:
  - 64bit
  - WindowsPE
- processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
- python: 3.8.5
- version: 10.0.18362

bug / fix help wanted

Source

chrismaliszewski

Most helpful comment

Forgot to report. The code you suggested works. Thank you again, @ydcjeff.

chrismaliszewski on 7 Oct 2020

❤1 🎉1

All 12 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 28 Sep 2020

@chrismaliszewski can you confirm this now stops the training epoch?

ydcjeff on 3 Oct 2020

Should I update in conda command line, nothing has changed:

Traceback (most recent call last):
  File "XXX\__main__train.py", line 54, in <module>
    trainer.fit(model)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1084, in fit
    results = self.accelerator_backend.train(model)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 39, in train
    results = self.trainer.run_pretrain_routine(model)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 394, in train
    self.run_training_epoch()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 496, in run_training_epoch
    batch_output.training_step_output_for_epoch_end,
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\utilities\parsing.py", line 144, in __getattr__
    raise AttributeError(f'Missing attribute "{key}"')
AttributeError: Missing attribute "training_step_output_for_epoch_end"
Epoch 0:   0%|          | 0/4 [00:00<?, ?it/s]

Process finished with exit code 1

Or should I update directly from GitHub, i.e. using the method provided here: https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github?

chrismaliszewski on 3 Oct 2020

Yes please, it's fixed on master, it hasn't been released yet. You can do the following in your conda or pip env

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@master --upgrade

ydcjeff on 3 Oct 2020

After the update you suggested I have MAJOR problems, even crashing errors.

In regards to the issue I posted, I have the following error, no matter if I return -1 or anything else:

File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 442, in fit
    results = self.accelerator_backend.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 47, in train
    results = self.train_or_test()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\base_backend.py", line 43, in train_or_test
    results = self.trainer.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 489, in train
    self.train_loop.run_training_epoch()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 516, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 617, in run_training_batch
    response = self.trainer.call_hook('on_batch_start')
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 807, in call_hook
    output = hook_fx(*args, **kwargs)
TypeError: on_batch_start() missing 1 required positional argument: 'batch'
Epoch 0:   0%|          | 0/4 [00:00<?, ?it/s]

I haven't changed anything in the definition of my function and it looks as follows:

    def on_batch_start(self, batch):
        return -1
        if self.get_early_stop(self.hparams['early_stop_path']):
            return -1
        else:
            return batch

where get_early_stop returns Boolean if the training should early stop at any given time.

For the unknown reason, the args and kwargs in the line output = hook_fx(*args, **kwargs) are empty.

If I remove method on_batch_start, the code follows further but crashes elsewhere, read the next comment.

If you need further information, let me know and I try helping.

chrismaliszewski on 3 Oct 2020

In terms of other problems with the version.

I have a crashing error or a warning is being displayed. Messages exclude each other in the potential resolving way.
Error message 1.

 File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 442, in fit
    results = self.accelerator_backend.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 47, in train
    results = self.train_or_test()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\base_backend.py", line 43, in train_or_test
    results = self.trainer.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 489, in train
    self.train_loop.run_training_epoch()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 539, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 604, in run_evaluation
    self.evaluation_loop.on_evaluation_epoch_end()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 298, in on_evaluation_epoch_end
    self.trainer.call_hook('on_validation_epoch_end', *args, **kwargs)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 800, in call_hook
    trainer_hook(*args, **kwargs)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 87, in on_validation_epoch_end
    callback.on_validation_epoch_end(self, self.get_model())
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\callbacks\early_stopping.py", line 152, in on_validation_epoch_end
    if self._validate_condition_metric(trainer.logger_connector.callback_metrics):
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\callbacks\early_stopping.py", line 116, in _validate_condition_metric
    raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. Either add `val_loss` to the return of `validation_epoch_end` or modify your `EarlyStopping` callback to use any of the following: ``

Note the part Either add val_loss to the return of validation_epoch_end and that the error message is cut with nothing after the following:.

Warning message 2.
UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule
after I remove return {'val_loss': loss} leaving just self.log("val_loss", loss). So which one should I do? return or not return?

chrismaliszewski on 3 Oct 2020

Okay. Regarding the first issue, on_batch_start is deprecated in 0.9 and will be removed in 1.0.
https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.hooks.html#pytorch_lightning.core.hooks.ModelHooks.on_batch_start

Please use on_train_batch_start.
https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.hooks.html#pytorch_lightning.core.hooks.ModelHooks.on_train_batch_start

Also you probably don't want to stop training epoch just at the start of the training, I assume.

    def on_train_batch_start(self, batch, batch_idx, dataloader_idx):
        if self.get_early_stop(self.hparams['early_stop_path']):
            return -1
        else:
            return batch

ydcjeff on 3 Oct 2020

👍1

The method you provided works without any errors. Thank you for the advice.

chrismaliszewski on 3 Oct 2020

🎉1

Regarding the second issue, you can use self.log('val_loss', loss) to use val_loss in the early stop callback.
For now, you can ignore the below warning, currently at fixing at #3812

UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule

ydcjeff on 3 Oct 2020

👍1

@ydcjeff, I'll report you on that later. It's 9PM my time. Thank you.

chrismaliszewski on 3 Oct 2020

👍1

Okay. Feel free to create an another issue, if something doesn't work with earlystopping

ydcjeff on 3 Oct 2020

Forgot to report. The code you suggested works. Thank you again, @ydcjeff.

chrismaliszewski on 7 Oct 2020

❤1 🎉1

Was this page helpful?

0 / 5 - 0 ratings