I used the documentation way of stopping the training (https://pytorch-lightning.readthedocs.io/en/latest/early_stopping.html#enable-early-stopping-using-callbacks-on-epoch-end).
If on_bath_start method returns -1 at the very beginning of an epoch, the titled AttributeError exception.
The problem is in training_loop.py line 496 (batch_output.training_step_output_for_epoch_end).
Use the method and run your code:
def on_batch_start(self, batch):
return -1
Check batch_output value if equals -1 before running trainin_loop.py line 495.
The early stopping method achieved the same way the documentation specifies should not throw an exception but rather simply stop the training.
Hi! thanks for your contribution!, great first issue!
@chrismaliszewski can you confirm this now stops the training epoch?
Should I update in conda command line, nothing has changed:
Traceback (most recent call last):
File "XXX\__main__train.py", line 54, in <module>
trainer.fit(model)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1084, in fit
results = self.accelerator_backend.train(model)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 39, in train
results = self.trainer.run_pretrain_routine(model)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1239, in run_pretrain_routine
self.train()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 394, in train
self.run_training_epoch()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 496, in run_training_epoch
batch_output.training_step_output_for_epoch_end,
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\utilities\parsing.py", line 144, in __getattr__
raise AttributeError(f'Missing attribute "{key}"')
AttributeError: Missing attribute "training_step_output_for_epoch_end"
Epoch 0: 0%| | 0/4 [00:00<?, ?it/s]
Process finished with exit code 1
Or should I update directly from GitHub, i.e. using the method provided here: https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github?
Yes please, it's fixed on master, it hasn't been released yet. You can do the following in your conda or pip env
pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@master --upgrade
After the update you suggested I have MAJOR problems, even crashing errors.
In regards to the issue I posted, I have the following error, no matter if I return -1 or anything else:
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 442, in fit
results = self.accelerator_backend.train()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 47, in train
results = self.train_or_test()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\base_backend.py", line 43, in train_or_test
results = self.trainer.train()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 489, in train
self.train_loop.run_training_epoch()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 516, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 617, in run_training_batch
response = self.trainer.call_hook('on_batch_start')
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 807, in call_hook
output = hook_fx(*args, **kwargs)
TypeError: on_batch_start() missing 1 required positional argument: 'batch'
Epoch 0: 0%| | 0/4 [00:00<?, ?it/s]
I haven't changed anything in the definition of my function and it looks as follows:
def on_batch_start(self, batch):
return -1
if self.get_early_stop(self.hparams['early_stop_path']):
return -1
else:
return batch
where get_early_stop returns Boolean if the training should early stop at any given time.
For the unknown reason, the args and kwargs in the line output = hook_fx(*args, **kwargs) are empty.
If I remove method on_batch_start, the code follows further but crashes elsewhere, read the next comment.
If you need further information, let me know and I try helping.
In terms of other problems with the version.
I have a crashing error or a warning is being displayed. Messages exclude each other in the potential resolving way.
Error message 1.
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 442, in fit
results = self.accelerator_backend.train()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 47, in train
results = self.train_or_test()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\accelerators\base_backend.py", line 43, in train_or_test
results = self.trainer.train()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 489, in train
self.train_loop.run_training_epoch()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 539, in run_training_epoch
self.trainer.run_evaluation(test_mode=False)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 604, in run_evaluation
self.evaluation_loop.on_evaluation_epoch_end()
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 298, in on_evaluation_epoch_end
self.trainer.call_hook('on_validation_epoch_end', *args, **kwargs)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 800, in call_hook
trainer_hook(*args, **kwargs)
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 87, in on_validation_epoch_end
callback.on_validation_epoch_end(self, self.get_model())
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\callbacks\early_stopping.py", line 152, in on_validation_epoch_end
if self._validate_condition_metric(trainer.logger_connector.callback_metrics):
File "YYY\anaconda3\envs\pt_cpu\lib\site-packages\pytorch_lightning\callbacks\early_stopping.py", line 116, in _validate_condition_metric
raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. Either add `val_loss` to the return of `validation_epoch_end` or modify your `EarlyStopping` callback to use any of the following: ``
Note the part Either add val_loss to the return of validation_epoch_end and that the error message is cut with nothing after the following:.
Warning message 2.
UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule
after I remove return {'val_loss': loss} leaving just self.log("val_loss", loss). So which one should I do? return or not return?
Okay. Regarding the first issue, on_batch_start is deprecated in 0.9 and will be removed in 1.0.
https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.hooks.html#pytorch_lightning.core.hooks.ModelHooks.on_batch_start
Please use on_train_batch_start.
https://pytorch-lightning.readthedocs.io/en/latest/api/pytorch_lightning.core.hooks.html#pytorch_lightning.core.hooks.ModelHooks.on_train_batch_start
Also you probably don't want to stop training epoch just at the start of the training, I assume.
def on_train_batch_start(self, batch, batch_idx, dataloader_idx):
if self.get_early_stop(self.hparams['early_stop_path']):
return -1
else:
return batch
The method you provided works without any errors. Thank you for the advice.
Regarding the second issue, you can use self.log('val_loss', loss) to use val_loss in the early stop callback.
For now, you can ignore the below warning, currently at fixing at #3812
UserWarning: The validation_epoch_end should not return anything as of 9.1.to log, use self.log(...) or self.write(...) directly in the LightningModule
@ydcjeff, I'll report you on that later. It's 9PM my time. Thank you.
Okay. Feel free to create an another issue, if something doesn't work with earlystopping
Forgot to report. The code you suggested works. Thank you again, @ydcjeff.
Most helpful comment
Forgot to report. The code you suggested works. Thank you again, @ydcjeff.