Describe the bug
Training stops when setting val_check_interval<1.0 in the Trainer class as it doesn't recognise val_loss. I get the following warning at the end of the 3rd epoch:
Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss
To Reproduce
Steps to reproduce the behavior:
trainer line toExpected behavior
Training shouldn't stop and val_loss should be recognised.
Desktop (please complete the following information):
Additional context
This doesn't happen with 0.5.2.1 although it looks like something has changed with model saving mechanism since it only seems to save the best model in 0.5.3.2.
EDIT: Also seems to happen when setting train_percent_check<1.0
can you post your test_end step?
I didn't use a test set since it is optional. The default MNIST example in the README will reproduce the behaviour when changing the trainer line to:
```
trainer = Trainer(val_check_interval=0.5,default_save_path="log_dir")
trainer = Trainer(train_percent_check=0.5,default_save_path="log_dir")
````
sorry, meant validation_end
```python
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
````
I tried changing 'avg_val_loss' -> 'val_loss' but the same issue occurs.
it should be val_loss
I tried it with val_loss too.
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'val_loss': avg_loss, 'log': tensorboard_logs}
The issue still occurs.
The issue only doesn't happen when using the default val_check_interval and train_percent_check in the Trainer.
ok got it. can you share the stacktrace?
There is no error just a warning at the end of epoch 3 and then training stops.
Epoch 3: : 1894batch [00:04, 403.95batch/s, batch_nb=18, loss=1.014, v_nb=0] /usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss
RuntimeWarning)
It looks like the problem is that there is only one self.callback_metrics which is sometimes overwritten by self.run_training_batch and sometimes by self.run_evaluation. At the same time, early stopping callback uses self.callback_metrics at the end of the training epoch. And the problem is that there can be no validation run at the last training batch. In that case self.callback_metrics will contain only the metrics from the last training batch.
If it is true, we can just force validation computation at the end of the training epoch.
@kuynzereb we shouldn't force computation. just partition self.callback_metrics to have
self.callback_metrics['train'] = {}
self.callback_metrics['val'] = {}
self.callback_metrics['test'] = {}
anyone interested in the PR?
I created a PR #492 but made a simple change to update self.callback_metrics instead as then it won't require changes to the EarlyStopping callback. It also seems more consistent with how the other logging metrics are updated.
@williamFalcon @ryanwongsa
Had this issue been fixed?
I'm facing the same issue after changing my local package along with #492.
BTW, I don't have any validation loops and early stopping callbacks.
FYI, I was still having this issue which I traced to not having enough trainer.overfit_pc to check relative to my batch-size and num gpus. validation sanity checks and validation end seemed to get skipped (if I ran without early stopping) thereby not returning my loss metrics dict. solved purely by increasing overfit_pc.
I encountered the same problem when I set check_val_every_n_epoch>1 in the Trainer.
Validation would only be run after some training epochs, but after the first training epoch, this check is done:
def check_metrics(self, logs):
monitor_val = logs.get(self.monitor)
error_msg = (f'Early stopping conditioned on metric `{self.monitor}`'
f' which is not available. Available metrics are:'
f' `{"`, `".join(list(logs.keys()))}`')
if monitor_val is None:
if self.strict:
raise RuntimeError(error_msg)
if self.verbose > 0:
rank_zero_warn(error_msg, RuntimeWarning)
return False
return True
And if the strict parameter is set (as default), the trainer terminates with that exception.
I think the problem is that EarlyStopping checks the presence of a validation metric on_epoch_end rather than on_validation_end. If validation is performed at the end of every epoch this is not a problem, but if one tries to run validation less often it becomes a problem. One could set strict to False, but I think users should get a warning if the validation metric they try to monitor is not present after a validation run.
Instead, a good solution is to make EarlyStopping use on_validation_end instead of on_epoch_end. I believe this was the intention of EarlyStopping from the beginning. I'm opening a PR to discuss this quick fix.
Most helpful comment
@williamFalcon @ryanwongsa
Had this issue been fixed?
I'm facing the same issue after changing my local package along with #492.
BTW, I don't have any validation loops and early stopping callbacks.