Pytorch-lightning: EarlyStopping not working / wrong keys in log

Created on 3 Sep 2020 · 6Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I’m trying to implement EarlyStopping when validation loss stops decreasing. I add callback as follows:

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.l1_loss(y_hat, y)
    result = pl.EvalResult(checkpoint_on=loss)
    result.log("val_loss", loss, sync_dist=True)
    return result

early_stop_callback = EarlyStopping(
    monitor="val_loss",
    min_delta=0.1,
    patience=1,
    verbose=True,
    mode="min")

trainer = pl.Trainer(
    gpus=-1,
    max_epochs=50,
    distributed_backend="ddp",
    early_stop_callback=early_stop_callback,
    logger=wandb_logger)

This does not work - it is returning False at from the _validate_condition_metric function
When I checked what’s in the log dictionary, the values looked like
{'val_early_stop_on': None, 'val_checkpoint_on': tensor(0.5601, device='cuda:0')} - which is slightly confusing. Where does “val_checkpoint_on” come from and why it is not called “val_loss”?

It feels like it might be slightly connected to the result = pl.EvalResult(checkpoint_on=loss) line.
I was reading documentation, but frankly speaking I found
checkpoint_on (Union[Tensor, bool, None]) – Metric to checkpoint on. to be slightly not intuitive. What does it mean for the metric to be checkpoints on? And does it really connect to keys in log being renamed in a strange way?

Code sample

https://github.com/matsuokalab/cosmoflow/blob/ac75fe317f8daf3444c96b837bb109064aa81dab/main.py

Expected behavior

Expecting EarlyStopping to work, log to have val_loss key

Environment

* CUDA:
    - GPU:
        - Tesla V100-SXM2-16GB
        - Tesla V100-SXM2-16GB
        - Tesla V100-SXM2-16GB
        - Tesla V100-SXM2-16GB
    - available:         True
    - version:           10.2
* Packages:
    - numpy:             1.19.1
    - pyTorch_debug:     False
    - pyTorch_version:   1.6.0
    - pytorch-lightning: 0.9.0
    - tensorboard:       2.2.0
    - tqdm:              4.46.1
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - ELF
    - processor:         x86_64
    - python:            3.8.2
    - version:           #1 SMP Fri Apr 20 16:44:24 UTC 2018

ResultObj bug / fix help wanted

Source

undertherain

Most helpful comment

Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!

undertherain on 3 Sep 2020

🎉2

All 6 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 3 Sep 2020

Hi @undertherain, currently the monitor key of EarlyStopping will have no effect when you use along with EvalResult or TrainResult.
So, you need to provide early_stop_on with the loss (metric) in validation_step if using with EvalResult

def validation_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = F.l1_loss(y_hat, y)
    result = pl.EvalResult(checkpoint_on=loss, early_stop_on=loss)  # changes here
    result.log("val_loss", loss, sync_dist=True)
    return result

Yes, val_early_stop_on and val_checkpoint_on are from EvalResult.
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/core/step_result.py#L786

~~While using with Result object, the val_loss you logged will be only for logging to the logger, I guess.~~

API docs: https://pytorch-lightning.readthedocs.io/en/latest/results.html#evalresult-api

ydcjeff on 3 Sep 2020

undertherain on 3 Sep 2020

🎉2

I guess can close this, if it is considered "not a bug, but a feature" :)

undertherain on 3 Sep 2020

👍1

Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!

Glad to hear that

we also have some discussions going on #3286

ydcjeff on 3 Sep 2020

closing it ok? since you found a workaround and ignored monitor is a known issue and discussion happens in other issue as pointed out by @ydcjeff