Pytorch-lightning: Early stopping conditioned on metric `val_loss` isn't recognised when setting the val_check_interval

Created on 10 Nov 2019 · 14Comments · Source: PyTorchLightning/pytorch-lightning

Describe the bug
Training stops when setting val_check_interval<1.0 in the Trainer class as it doesn't recognise val_loss. I get the following warning at the end of the 3rd epoch:

Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss

To Reproduce
Steps to reproduce the behavior:

Run the CoolModel example but change the trainer line to
trainer = Trainer(val_check_interval=0.5,default_save_path="test")
Training will stop at the end of the third epoch and the above warning will show.

Expected behavior
Training shouldn't stop and val_loss should be recognised.

Desktop (please complete the following information):

VM: Google Colab
Version 0.5.3.2

Additional context
This doesn't happen with 0.5.2.1 although it looks like something has changed with model saving mechanism since it only seems to save the best model in 0.5.3.2.

EDIT: Also seems to happen when setting train_percent_check<1.0

bug / fix

Source

ryanwongsa

👍4

Most helpful comment

@williamFalcon @ryanwongsa
Had this issue been fixed?
I'm facing the same issue after changing my local package along with #492.
BTW, I don't have any validation loops and early stopping callbacks.

S-aiueo32 on 2 Dec 2019

👍7

All 14 comments

can you post your test_end step?

williamFalcon on 10 Nov 2019

I didn't use a test set since it is optional. The default MNIST example in the README will reproduce the behaviour when changing the trainer line to:
```
trainer = Trainer(val_check_interval=0.5,default_save_path="log_dir")

or

trainer = Trainer(train_percent_check=0.5,default_save_path="log_dir")
````

ryanwongsa on 11 Nov 2019

sorry, meant validation_end

williamFalcon on 11 Nov 2019

```python
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
````
I tried changing 'avg_val_loss' -> 'val_loss' but the same issue occurs.

ryanwongsa on 11 Nov 2019

it should be val_loss

williamFalcon on 11 Nov 2019

I tried it with val_loss too.

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

The issue still occurs.

The issue only doesn't happen when using the default val_check_interval and train_percent_check in the Trainer.

ryanwongsa on 11 Nov 2019

ok got it. can you share the stacktrace?

williamFalcon on 11 Nov 2019

There is no error just a warning at the end of epoch 3 and then training stops.

Epoch 3: : 1894batch [00:04, 403.95batch/s, batch_nb=18, loss=1.014, v_nb=0] /usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss
  RuntimeWarning)

ryanwongsa on 11 Nov 2019

It looks like the problem is that there is only one self.callback_metrics which is sometimes overwritten by self.run_training_batch and sometimes by self.run_evaluation. At the same time, early stopping callback uses self.callback_metrics at the end of the training epoch. And the problem is that there can be no validation run at the last training batch. In that case self.callback_metrics will contain only the metrics from the last training batch.

If it is true, we can just force validation computation at the end of the training epoch.

kuynzereb on 11 Nov 2019

@kuynzereb we shouldn't force computation. just partition self.callback_metrics to have

self.callback_metrics['train'] = {}
self.callback_metrics['val'] = {}
self.callback_metrics['test'] = {}

anyone interested in the PR?

williamFalcon on 11 Nov 2019

I created a PR #492 but made a simple change to update self.callback_metrics instead as then it won't require changes to the EarlyStopping callback. It also seems more consistent with how the other logging metrics are updated.

ryanwongsa on 11 Nov 2019

S-aiueo32 on 2 Dec 2019

👍7

FYI, I was still having this issue which I traced to not having enough trainer.overfit_pc to check relative to my batch-size and num gpus. validation sanity checks and validation end seemed to get skipped (if I ran without early stopping) thereby not returning my loss metrics dict. solved purely by increasing overfit_pc.

jamesjjcondon on 17 Feb 2020

👍2

I encountered the same problem when I set check_val_every_n_epoch>1 in the Trainer.
Validation would only be run after some training epochs, but after the first training epoch, this check is done:

def check_metrics(self, logs):
    monitor_val = logs.get(self.monitor)
    error_msg = (f'Early stopping conditioned on metric `{self.monitor}`'
                 f' which is not available. Available metrics are:'
                 f' `{"`, `".join(list(logs.keys()))}`')

    if monitor_val is None:
        if self.strict:
            raise RuntimeError(error_msg)
        if self.verbose > 0:
            rank_zero_warn(error_msg, RuntimeWarning)

        return False

    return True

And if the strict parameter is set (as default), the trainer terminates with that exception.

I think the problem is that EarlyStopping checks the presence of a validation metric on_epoch_end rather than on_validation_end. If validation is performed at the end of every epoch this is not a problem, but if one tries to run validation less often it becomes a problem. One could set strict to False, but I think users should get a warning if the validation metric they try to monitor is not present after a validation run.

Instead, a good solution is to make EarlyStopping use on_validation_end instead of on_epoch_end. I believe this was the intention of EarlyStopping from the beginning. I'm opening a PR to discuss this quick fix.