Default checkpoint_callback in Trainer() does not work so model's checkpoints are not saved.
Steps to reproduce the behavior:
I first created a simple implementation of a LightningModule. This contains:
class Model(pl.LightningModule):
...
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
result = pl.EvalResult()
result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
return result
...
when I run it with
m = Model()
trainer.fit(model) # using default settings
I get
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: The metric you returned None must be a `torch.Tensor` instance, checkpoint not saved HINT: what is the value of loss in validation_epoch_end()?
warnings.warn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: Can save best model only with loss available, skipping.
warnings.warn(*args, **kwargs)
Following the hint, I then implemented the validation_epoch_end() method. I was surprised I had to, given that most examples in the doc work without it and I assumed it was used when additional control on the validation procedure was needed. Perhaps, I would disable all callbacks by defaults to avoid the need to implement additional methods at first. Anyways, now my model looks like:
class Model(pl.LightningModule):
...
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
result = pl.EvalResult()
result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
return result
def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x for x in outputs['val_loss']]).mean()
return {"val_loss": avg_loss}
...
In the above, I followed https://github.com/PyTorchLightning/pytorch-lightning/issues/1153#issuecomment-599792149 but got:
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluate_loop.py in log_epoch_metrics(self, eval_results)
148 if isinstance(eval_results, list):
149 for eval_result in eval_results:
--> 150 self.trainer.callback_metrics = eval_result.callback_metrics
151 else:
152 self.trainer.callback_metrics = eval_results.callback_metrics
AttributeError: 'dict' object has no attribute 'callback_metrics'
Thus, I tried to use the new Result container:
class Model(pl.LightningModule):
...
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
result = pl.EvalResult()
result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
return result
def validation_epoch_end(self, outputs):
loss = torch.stack([x for x in outputs['val_loss']]).mean()
result = pl.EvalResult()
result.log('val_loss', loss)
return result
...
When I fit the model again, I get:
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: The metric you returned None must be a `torch.Tensor` instance, checkpoint not saved HINT: what is the value of loss in validation_epoch_end()?
warnings.warn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: Can save best model only with loss available, skipping.
warnings.warn(*args, **kwargs)
If I print the elements of result, I get {'val_loss': tensor(0.6092, device='cuda:0')}, so val_loss is a tensor already.
My last attempt was to define my callback:
from pytorch_lightning.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
monitor='val_loss',
mode='min',
)
trainer = pl.Trainer(..., checkpoint_callback=checkpoint_callback)
but unfortunately I still got:
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: The metric you returned None must be a `torch.Tensor` instance, checkpoint not saved HINT: what is the value of val_loss in validation_epoch_end()?
warnings.warn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: Can save best model only with val_loss available, skipping.
warnings.warn(*args, **kwargs)
Would you be able to help?
I would expect Pytorch-lightning to work with minimal boilerplate (e.g. only with, training_step, validation_step). Perhaps, checkpoints should be disabled in Trainer() by default. Or, validation_step() could return automatically the aggregated metric by default. Otherwise, we may need a better explanation on how to get the aggregated metrics out the validation step, and make more clear that validation_epoch_end() is needed 👍
I am using Google Colab:
* CUDA:
- GPU:
- Tesla K80
- available: True
- version: 10.1
* Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.6.0+cu101
- pytorch-lightning: 0.9.1rc1
- tensorboard: 2.2.0
- tqdm: 4.41.1
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.6.9
- version: #1 SMP Thu Jul 23 08:00:38 PDT 2020
Hi! thanks for your contribution!, great first issue!
EDIT: Checkpoints work if I set them in the EvalResult like:
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
result = pl.EvalResult(checkpoint_on=loss)
result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
return result
but 1) I would like to keep callbacks separate from the module and only in the Trainer(). 2) Will setting the checkpoint in EvalResult interfere with other callbacks set in Trainer()? 3) How can I set up callbacks to monitor custom (aggregate) metrics?
@JackCaster mind sending a PR with this failing example as minimal test so we can fix it and also prevent this case in the future... 🐰
@JackCaster mind sending a PR with this failing example as minimal test so we can fix it and also prevent this case in the future... 🐰
Surely I can do that. I am on vacation atm without pc. I will send it sometime next week
ok, so to your points.
For this reason, we're tying the related monitor metrics to the module.
It means your lightningmodule leaks abstraction.
Great! I ran a quick test and it works. @Borda Do you still need the PR?
I see that now the EarlyStopping callback is disabled by default so we need to initialize ourselves. Instead, ModelCheckpoint is active by default and it will monitor checkpoint_on.
Regarding the EarlyStopping, there is an error in the documentation. While Trainer disables it by default https://github.com/PyTorchLightning/pytorch-lightning/blob/12184854f97f3d0ef8d72aaa801e661dc10d7058/pytorch_lightning/trainer/trainer.py#L85 in the doc (https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#early-stop-callback) it reads
None: Same as, if True is specified.
Default: None.
it should be instead:
None: Equivalent to True.
Default: False.
If this an error, shall I open a PR?
If this an error, shall I open a PR?
yes, please, just be aware that @williamFalcon is finishing some final Result refactoring which may collide your you PR :]
If this an error, shall I open a PR?
yes, please, just be aware that @williamFalcon is finishing some final Result refactoring which may collide your you PR :]
Alright. Perhaps, it is better if I wait a couple of days then.
that PR should be fine. please go ahead
Most helpful comment
EDIT: Checkpoints work if I set them in the
EvalResultlike:but 1) I would like to keep callbacks separate from the module and only in the
Trainer(). 2) Will setting the checkpoint inEvalResultinterfere with other callbacks set inTrainer()? 3) How can I set up callbacks to monitor custom (aggregate) metrics?