Pytorch-lightning: Model checkpoint is not working

Created on 15 Nov 2019  路  7Comments  路  Source: PyTorchLightning/pytorch-lightning

Describe the bug
Model checkpoint is not working, even with explicit checkpoint callback.

To Reproduce
Steps to reproduce the behavior:
This is the settings I'm using. hparams.checkpoint_path is actually a dir like './weights'

checkpoint_callback = ModelCheckpoint(
        filepath=hparams.checkpoint_path,
        save_best_only=True,
        verbose=True,
        monitor='val_loss',
        mode='min',
        prefix='')

    model = Net3DMMSTN(hparams)
    trainer = Trainer(
        early_stop_callback=None,
        track_grad_norm=2,
        checkpoint_callback=checkpoint_callback,
        print_nan_grads=True,
        weights_summary='full',
        default_save_path=hparams.checkpoint_path,
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=hparams.gpus,
        fast_dev_run=hparams.dev_run)

It's not saving the model weights.
Even leaving the checkpoint_callback in default mode it' not saving the weights.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Linux
    os kernel version: 21~18.04.1-Ubuntu SMP Mon Oct 7 04:51:28 UTC 2019
    os release version: 5.0.0-1021-gcp
    os platform: Linux-5.0.0-1021-gcp-x86_64-with-Ubuntu-18.04-bionic
  • pytorch-lightning : 0.5.3
  • PyTorch Version : 1.3.0
    == cuda libs ===================================================
    /usr/local/cuda-10.0/doc/man/man7/libcudart.7
    /usr/local/cuda-10.0/doc/man/man7/libcudart.so.7
    /usr/local/cuda-10.0/lib64/libcudart_static.a
    /usr/local/cuda-10.0/lib64/libcudart.so.10.0.130
bug / fix

Most helpful comment

I'm on my phone right now so I can't verify, but try changing monitor='val_loss', to monitor='avg_val_loss',

All 7 comments

1) Is the run completely finishing? The default callback saves at the end of an epoch.
2) You have val_loss set. Are you sure you're logging it correctly? Check for an error message like Can save best model...

@jeffling yes the run completely finishes

My two function are setup like this:

def validation_step(self, batch, batch_nb):
        input, label = batch
        sel, mask, alpha, predgrid, _ = self.forward(input)
        loss, l1, l2, l3, l4 = self.criterion(sel, label, alpha, predgrid)
        self.last_val_images = input

        log_dict = {
            'valid/loss': loss,
            'valid/euclidean_loss': l1,
            'valid/sse_loss': l2,
            'valid/siamese_loss': l3,
            'valid/symmetry_loss': l4
        }

        return {'val_loss': loss, 'log': log_dict}

def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        log_dict = {
            'valid/loss':
            avg_loss,
            'valid/euclidean_loss':
            torch.stack(
                [x['log']['valid/euclidean_loss'] for x in outputs]).mean(),
            'valid/sse_loss':
            torch.stack([x['log']['valid/sse_loss'] for x in outputs]).mean(),
            'valid/siamese_loss':
            torch.stack(
                [x['log']['valid/siamese_loss'] for x in outputs]).mean(),
            'valid/symmetry_loss':
            torch.stack(
                [x['log']['valid/symmetry_loss'] for x in outputs]).mean()
        }

        return {'avg_val_loss': avg_loss, 'log': log_dict}

Is this the correct way to do it ?

I'm on my phone right now so I can't verify, but try changing monitor='val_loss', to monitor='avg_val_loss',

@jeffling @williamFalcon making the change worked but it throws up the following error.

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 211, in run
    self._record_writer.write(data)
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/summary/writer/record_writer.py", line 39, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 474, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 139, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 143, in _write
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'/home/ubuntu/3DMMasSTN-Pytorch/weights/lightning_logs/version_0/tf/events.out.tfevents.1574084352.gpunew9.4187.0'

My hparams.checkpoint_path is actually a dir like './weights'
Is there some way to save it in version_0 directory ? Also according to the docs model should check point automatically without and explicit
trainer = Trainer(checkpoint_callback=checkpoint_callback)
option in the trainer.

Turns out changing,

def validation_end(self, outputs):
        ...
        return {'avg_val_loss': avg_loss, 'log': log_dict}

to

def validation_end(self, outputs):
        ...
        return {'val_loss': avg_loss, 'log': log_dict}

starts checkpointing without explicitly passing the checkpoint_callback option. This should be added to docs.

So for the checkpoint_callback to work, we don't need to pass it to the trainer?

@Jiequannnnnnnnnn yes

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

Vichoko picture Vichoko  路  3Comments

versatran01 picture versatran01  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments