I’m trying to implement EarlyStopping when validation loss stops decreasing. I add callback as follows:
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.l1_loss(y_hat, y)
result = pl.EvalResult(checkpoint_on=loss)
result.log("val_loss", loss, sync_dist=True)
return result
early_stop_callback = EarlyStopping(
monitor="val_loss",
min_delta=0.1,
patience=1,
verbose=True,
mode="min")
trainer = pl.Trainer(
gpus=-1,
max_epochs=50,
distributed_backend="ddp",
early_stop_callback=early_stop_callback,
logger=wandb_logger)
This does not work - it is returning False at from the _validate_condition_metric function
When I checked what’s in the log dictionary, the values looked like
{'val_early_stop_on': None, 'val_checkpoint_on': tensor(0.5601, device='cuda:0')} - which is slightly confusing. Where does “val_checkpoint_on” come from and why it is not called “val_loss”?
It feels like it might be slightly connected to the result = pl.EvalResult(checkpoint_on=loss) line.
I was reading documentation, but frankly speaking I found
checkpoint_on (Union[Tensor, bool, None]) – Metric to checkpoint on. to be slightly not intuitive. What does it mean for the metric to be checkpoints on? And does it really connect to keys in log being renamed in a strange way?
https://github.com/matsuokalab/cosmoflow/blob/ac75fe317f8daf3444c96b837bb109064aa81dab/main.py
Expecting EarlyStopping to work, log to have val_loss key
* CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
* Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.46.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.2
- version: #1 SMP Fri Apr 20 16:44:24 UTC 2018
Hi! thanks for your contribution!, great first issue!
Hi @undertherain, currently the monitor key of EarlyStopping will have no effect when you use along with EvalResult or TrainResult.
So, you need to provide early_stop_on with the loss (metric) in validation_step if using with EvalResult
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.l1_loss(y_hat, y)
result = pl.EvalResult(checkpoint_on=loss, early_stop_on=loss) # changes here
result.log("val_loss", loss, sync_dist=True)
return result
Yes, val_early_stop_on and val_checkpoint_on are from EvalResult.
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/core/step_result.py#L786
While using with Result object, the val_loss you logged will be only for logging to the logger, I guess.
API docs: https://pytorch-lightning.readthedocs.io/en/latest/results.html#evalresult-api
Aha, I saw early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still that monitor is not working, but at least the stopping works now, thanks!
I guess can close this, if it is considered "not a bug, but a feature" :)
Aha, I saw
early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still thatmonitoris not working, but at least the stopping works now, thanks!
Glad to hear that
we also have some discussions going on #3286
closing it ok? since you found a workaround and ignored monitor is a known issue and discussion happens in other issue as pointed out by @ydcjeff
Most helpful comment
Aha, I saw
early_stop_on=loss, but thought it's some sort of automated way without callback. It is a bit counter-intuitive still thatmonitoris not working, but at least the stopping works now, thanks!