Pytorch-lightning: Checkpoints not working

Created on 31 Aug 2020  ·  9Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug

Default checkpoint_callback in Trainer() does not work so model's checkpoints are not saved.

To Reproduce

Steps to reproduce the behavior:

I first created a simple implementation of a LightningModule. This contains:

class Model(pl.LightningModule):
   ...
   def validation_step(self, batch, batch_idx):
      x, y = batch
      y_hat = self(x)
      loss = F.cross_entropy(y_hat, y)
      result = pl.EvalResult()
      result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
      return result
   ...

when I run it with

m = Model()
trainer.fit(model) # using default settings

I get

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: The metric you returned None must be a `torch.Tensor` instance, checkpoint not saved HINT: what is the value of loss in validation_epoch_end()?
  warnings.warn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: Can save best model only with loss available, skipping.
  warnings.warn(*args, **kwargs)

Following the hint, I then implemented the validation_epoch_end() method. I was surprised I had to, given that most examples in the doc work without it and I assumed it was used when additional control on the validation procedure was needed. Perhaps, I would disable all callbacks by defaults to avoid the need to implement additional methods at first. Anyways, now my model looks like:

class Model(pl.LightningModule):
   ...
   def validation_step(self, batch, batch_idx):
      x, y = batch
      y_hat = self(x)
      loss = F.cross_entropy(y_hat, y)
      result = pl.EvalResult()
      result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
      return result

   def validation_epoch_end(self, outputs):
      avg_loss = torch.stack([x for x in outputs['val_loss']]).mean()
      return {"val_loss": avg_loss}
   ...

In the above, I followed https://github.com/PyTorchLightning/pytorch-lightning/issues/1153#issuecomment-599792149 but got:

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluate_loop.py in log_epoch_metrics(self, eval_results)
    148             if isinstance(eval_results, list):
    149                 for eval_result in eval_results:
--> 150                     self.trainer.callback_metrics = eval_result.callback_metrics
    151             else:
    152                 self.trainer.callback_metrics = eval_results.callback_metrics

AttributeError: 'dict' object has no attribute 'callback_metrics'

Thus, I tried to use the new Result container:

class Model(pl.LightningModule):
   ...
   def validation_step(self, batch, batch_idx):
      x, y = batch
      y_hat = self(x)
      loss = F.cross_entropy(y_hat, y)
      result = pl.EvalResult()
      result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
      return result

   def validation_epoch_end(self, outputs):
      loss = torch.stack([x for x in outputs['val_loss']]).mean()
      result = pl.EvalResult()
      result.log('val_loss', loss)
      return result
   ...

When I fit the model again, I get:

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: The metric you returned None must be a `torch.Tensor` instance, checkpoint not saved HINT: what is the value of loss in validation_epoch_end()?
  warnings.warn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: Can save best model only with loss available, skipping.
  warnings.warn(*args, **kwargs)

If I print the elements of result, I get {'val_loss': tensor(0.6092, device='cuda:0')}, so val_loss is a tensor already.

My last attempt was to define my callback:

from pytorch_lightning.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    mode='min',
)

trainer = pl.Trainer(..., checkpoint_callback=checkpoint_callback)

but unfortunately I still got:

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: The metric you returned None must be a `torch.Tensor` instance, checkpoint not saved HINT: what is the value of val_loss in validation_epoch_end()?
  warnings.warn(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: RuntimeWarning: Can save best model only with val_loss available, skipping.
  warnings.warn(*args, **kwargs)

Would you be able to help?

Expected behavior

I would expect Pytorch-lightning to work with minimal boilerplate (e.g. only with, training_step, validation_step). Perhaps, checkpoints should be disabled in Trainer() by default. Or, validation_step() could return automatically the aggregated metric by default. Otherwise, we may need a better explanation on how to get the aggregated metrics out the validation step, and make more clear that validation_epoch_end() is needed 👍

Environment

I am using Google Colab:

* CUDA:
    - GPU:
        - Tesla K80
    - available:         True
    - version:           10.1
* Packages:
    - numpy:             1.18.5
    - pyTorch_debug:     False
    - pyTorch_version:   1.6.0+cu101
    - pytorch-lightning: 0.9.1rc1
    - tensorboard:       2.2.0
    - tqdm:              4.41.1
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - 
    - processor:         x86_64
    - python:            3.6.9
    - version:           #1 SMP Thu Jul 23 08:00:38 PDT 2020
Checkpoint bug / fix help wanted

Most helpful comment

EDIT: Checkpoints work if I set them in the EvalResult like:

    def validation_step(self, batch, batch_idx):
      x, y = batch
      y_hat = self(x)
      loss = F.cross_entropy(y_hat, y)
      result = pl.EvalResult(checkpoint_on=loss)
      result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
      return result

but 1) I would like to keep callbacks separate from the module and only in the Trainer(). 2) Will setting the checkpoint in EvalResult interfere with other callbacks set in Trainer()? 3) How can I set up callbacks to monitor custom (aggregate) metrics?

All 9 comments

Hi! thanks for your contribution!, great first issue!

EDIT: Checkpoints work if I set them in the EvalResult like:

    def validation_step(self, batch, batch_idx):
      x, y = batch
      y_hat = self(x)
      loss = F.cross_entropy(y_hat, y)
      result = pl.EvalResult(checkpoint_on=loss)
      result.log('val_loss', loss, prog_bar=True, on_step=False, on_epoch=True)
      return result

but 1) I would like to keep callbacks separate from the module and only in the Trainer(). 2) Will setting the checkpoint in EvalResult interfere with other callbacks set in Trainer()? 3) How can I set up callbacks to monitor custom (aggregate) metrics?

@JackCaster mind sending a PR with this failing example as minimal test so we can fix it and also prevent this case in the future... 🐰

@JackCaster mind sending a PR with this failing example as minimal test so we can fix it and also prevent this case in the future... 🐰

Surely I can do that. I am on vacation atm without pc. I will send it sometime next week

ok, so to your points.

  1. Can we keep these metrics separate out of the lightning module? well, the lightningmodule needs to be self contained... that is what defines what to early stop on or not. If you don't do it there then you have to look in the module to figure out what to monitor...

For this reason, we're tying the related monitor metrics to the module.

  1. In fact, imagine your module requires a special callback. You can no longer share your model around and drop into any lightning trainer. But now you ALSO have to tell the person to not forget to init that special callback and do special things for it to work with the module.

It means your lightningmodule leaks abstraction.

Great! I ran a quick test and it works. @Borda Do you still need the PR?

I see that now the EarlyStopping callback is disabled by default so we need to initialize ourselves. Instead, ModelCheckpoint is active by default and it will monitor checkpoint_on.

Regarding the EarlyStopping, there is an error in the documentation. While Trainer disables it by default https://github.com/PyTorchLightning/pytorch-lightning/blob/12184854f97f3d0ef8d72aaa801e661dc10d7058/pytorch_lightning/trainer/trainer.py#L85 in the doc (https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#early-stop-callback) it reads

None: Same as, if True is specified.
Default: None.

it should be instead:

None: Equivalent to True.
Default: False.

If this an error, shall I open a PR?

If this an error, shall I open a PR?

yes, please, just be aware that @williamFalcon is finishing some final Result refactoring which may collide your you PR :]

If this an error, shall I open a PR?

yes, please, just be aware that @williamFalcon is finishing some final Result refactoring which may collide your you PR :]

Alright. Perhaps, it is better if I wait a couple of days then.

that PR should be fine. please go ahead

Was this page helpful?
0 / 5 - 0 ratings

Related issues

edenlightning picture edenlightning  ·  3Comments

maxime-louis picture maxime-louis  ·  3Comments

baeseongsu picture baeseongsu  ·  3Comments

as754770178 picture as754770178  ·  3Comments

versatran01 picture versatran01  ·  3Comments