Pytorch-lightning: Missing attribute "training_step_output_for_epoch_end"

Created on 28 Sep 2020  Â·  12Comments  Â·  Source: PyTorchLightning/pytorch-lightning

I used the documentation way of stopping the training (https://pytorch-lightning.readthedocs.io/en/latest/early_stopping.html#enable-early-stopping-using-callbacks-on-epoch-end).

If on_bath_start method returns -1 at the very beginning of an epoch, the titled AttributeError exception.
The problem is in training_loop.py line 496 (batch_output.training_step_output_for_epoch_end).

Code sample

Use the method and run your code:

    def on_batch_start(self, batch):
        return -1

Expected behavior

Check batch_output value if equals -1 before running trainin_loop.py line 495.
The early stopping method achieved the same way the documentation specifies should not throw an exception but rather simply stop the training.

Environment

  • CUDA:

    • GPU:

    • available: False

    • version: None

  • Packages:

    • numpy: 1.19.1

    • pyTorch_debug: False

    • pyTorch_version: 1.6.0

    • pytorch-lightning: 0.9.0

    • tqdm: 4.49.0

  • System:

    • OS: Windows

    • architecture:



      • 64bit


      • WindowsPE



    • processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel

    • python: 3.8.5

    • version: 10.0.18362

bug / fix help wanted

Most helpful comment

Forgot to report. The code you suggested works. Thank you again, @ydcjeff.

All 12 comments

Hi! thanks for your contribution!, great first issue!

@chrismaliszewski can you confirm this now stops the training epoch?

Should I update in conda command line, nothing has changed:

Traceback (most recent call last):
  File "XXX\__main__train.py", line 54, in <module>
    trainer.fit(model)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1084, in fit
    results = self.accelerator_backend.train(model)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 39, in train
    results = self.trainer.run_pretrain_routine(model)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 394, in train
    self.run_training_epoch()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 496, in run_training_epoch
    batch_output.training_step_output_for_epoch_end,
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\utilities\parsing.py", line 144, in __getattr__
    raise AttributeError(f'Missing attribute "{key}"')
AttributeError: Missing attribute "training_step_output_for_epoch_end"
Epoch 0:   0%|          | 0/4 [00:00<?, ?it/s]

Process finished with exit code 1

Or should I update directly from GitHub, i.e. using the method provided here: https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github?

Yes please, it's fixed on master, it hasn't been released yet. You can do the following in your conda or pip env

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@master --upgrade

After the update you suggested I have MAJOR problems, even crashing errors.

In regards to the issue I posted, I have the following error, no matter if I return -1 or anything else:

File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 442, in fit
    results = self.accelerator_backend.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 47, in train
    results = self.train_or_test()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\base_backend.py", line 43, in train_or_test
    results = self.trainer.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 489, in train
    self.train_loop.run_training_epoch()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 516, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 617, in run_training_batch
    response = self.trainer.call_hook('on_batch_start')
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 807, in call_hook
    output = hook_fx(*args, **kwargs)
TypeError: on_batch_start() missing 1 required positional argument: 'batch'
Epoch 0:   0%|          | 0/4 [00:00<?, ?it/s]

I haven't changed anything in the definition of my function and it looks as follows:

    def on_batch_start(self, batch):
        return -1
        if self.get_early_stop(self.hparams['early_stop_path']):
            return -1
        else:
            return batch

where get_early_stop returns Boolean if the training should early stop at any given time.

For the unknown reason, the args and kwargs in the line output = hook_fx(*args, **kwargs) are empty.

If I remove method on_batch_start, the code follows further but crashes elsewhere, read the next comment.

If you need further information, let me know and I try helping.

In terms of other problems with the version.

I have a crashing error or a warning is being displayed. Messages exclude each other in the potential resolving way.
Error message 1.

 File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 442, in fit
    results = self.accelerator_backend.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 47, in train
    results = self.train_or_test()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\base_backend.py", line 43, in train_or_test
    results = self.trainer.train()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 489, in train
    self.train_loop.run_training_epoch()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 539, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 604, in run_evaluation
    self.evaluation_loop.on_evaluation_epoch_end()
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 298, in on_evaluation_epoch_end
    self.trainer.call_hook('on_validation_epoch_end', *args, **kwargs)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 800, in call_hook
    trainer_hook(*args, **kwargs)
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 87, in on_validation_epoch_end
    callback.on_validation_epoch_end(self, self.get_model())
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\callbacks\early_stopping.py", line 152, in on_validation_epoch_end
    if self._validate_condition_metric(trainer.logger_connector.callback_metrics):
  File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\callbacks\early_stopping.py", line 116, in _validate_condition_metric
    raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. Either add `val_loss` to the return of `validation_epoch_end` or modify your `EarlyStopping` callback to use any of the following: ``

Note the part Either add val_loss to the return of validation_epoch_end and that the error message is cut with nothing after the following:.

Warning message 2.
UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule
after I remove return {'val_loss': loss} leaving just self.log("val_loss", loss). So which one should I do? return or not return?

Okay. Regarding the first issue, on_batch_start is deprecated in 0.9 and will be removed in 1.0.
https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.hooks.html#pytorch_lightning.core.hooks.ModelHooks.on_batch_start

Please use on_train_batch_start.
https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.hooks.html#pytorch_lightning.core.hooks.ModelHooks.on_train_batch_start

Also you probably don't want to stop training epoch just at the start of the training, I assume.

    def on_train_batch_start(self, batch, batch_idx, dataloader_idx):
        if self.get_early_stop(self.hparams['early_stop_path']):
            return -1
        else:
            return batch

The method you provided works without any errors. Thank you for the advice.

Regarding the second issue, you can use self.log('val_loss', loss) to use val_loss in the early stop callback.
For now, you can ignore the below warning, currently at fixing at #3812

UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule

@ydcjeff, I'll report you on that later. It's 9PM my time. Thank you.

Okay. Feel free to create an another issue, if something doesn't work with earlystopping

Forgot to report. The code you suggested works. Thank you again, @ydcjeff.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Vichoko picture Vichoko  Â·  3Comments

monney picture monney  Â·  3Comments

as754770178 picture as754770178  Â·  3Comments

anthonytec2 picture anthonytec2  Â·  3Comments

remisphere picture remisphere  Â·  3Comments