Pytorch-lightning: default EarlyStopping callback should not fail on missing val_loss data

Created on 18 Nov 2019  Â·  14Comments  Â·  Source: PyTorchLightning/pytorch-lightning

Describe the bug
My training script failed overnight — this is the last thing I see in the logs before the instance shut down:

python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: avg_val_loss_total,avg_val_jacc10,avg_val_ce
  RuntimeWarning)

It seems like we intended this to be a "warning" but it appears that it interrupted my training script. Do you think that's possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn't log a val_loss tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can't find a val_loss metric?

bug / fix

Most helpful comment

I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can't find val_loss. And it is also unpleasant that you will find out the absence of the required metric only at the end of the full first training epoch (and moreover the training will stop). So I would separate this into two different problems:
1) Default early stopping should not stop the training. We should either disable it if no val_loss was found or even just disable it at all by default.
2) We should somehow check in the very beginning of the training that the metric required by the early stopping is available after the validation loop. Now it is checked only at the end of the first training epoch and if it is not present the training stops.

All 14 comments

I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can't find val_loss. And it is also unpleasant that you will find out the absence of the required metric only at the end of the full first training epoch (and moreover the training will stop). So I would separate this into two different problems:
1) Default early stopping should not stop the training. We should either disable it if no val_loss was found or even just disable it at all by default.
2) We should somehow check in the very beginning of the training that the metric required by the early stopping is available after the validation loop. Now it is checked only at the end of the first training epoch and if it is not present the training stops.

I'm guessing you were doing check_val_every_n_epoch>1.
This error is because callback_metrics is what is used for early stopping. This is cleared and re-filled at every training step logging. A hacky solution I have found is to save the last val_loss as a model attribute self.val_loss, and return at every training step
Ex.
output {
'loss': loss,
'log': log_dict,
'progress_bar': prog_dict,
'val_loss': self.val_loss
}

Wow, indeed, there is a third problem:
3) It is not clear how early stopping should work when check_val_every_n_epoch > 1.

However, please note, that now callbacks metrics are not longer replaced by new ones but updated. It was fixed in #492.

So I would suggest the following:

  1. By default early stop callback is turned on but if there is no val_loss then we just warn the user that early stop callback will not work, and training will proceed as though there is no early stop callback.
  2. If early stop callback is explicitly specified by the user then we will force validation sanity check and will examine the metrics obtained from it. If the metric required by the early stop callback is not present then we will raise an error.

@williamFalcon, what do you think?

Isn't it possible that the user returns val_loss only in some epochs, e.g., only every other epoch (intentionally or not)?

Yeah, it is the problem if no val_loss is returned in some epochs. In that case early stopping will work quite strange. It happens, for example, if check_val_every_n_epoch > 1.

@awaelchli ... maybe modify the early stopping to skip the check when that key is missing?

very reasonable imo. @kuynzereb do you see any problem with this?

Nope, it sound good for me too. But we will need to explicitly remove this key from callback metrics in the start of each epoch otherwise it will be always available (now it always stores the metric from the last validation loop).

@awaelchli or @kuynzereb mind submitting a PR?

I can look into it.

@awaelchli any updates?

lost track of this after I ran into some unexpected behaviors. will try get back to it but it seems @kuynzereb has a better overview of early stopping than me.

It seems that we can just add a condition that early_stop_callback.on_epoch_end() should be called only if current_epoch % check_val_every_n_epoch == 0

Was this page helpful?
0 / 5 - 0 ratings