Pytorch-lightning: Model checkpoint is not working

Created on 15 Nov 2019 · 7Comments · Source: PyTorchLightning/pytorch-lightning

Describe the bug
Model checkpoint is not working, even with explicit checkpoint callback.

To Reproduce
Steps to reproduce the behavior:
This is the settings I'm using. hparams.checkpoint_path is actually a dir like './weights'

checkpoint_callback = ModelCheckpoint(
        filepath=hparams.checkpoint_path,
        save_best_only=True,
        verbose=True,
        monitor='val_loss',
        mode='min',
        prefix='')

    model = Net3DMMSTN(hparams)
    trainer = Trainer(
        early_stop_callback=None,
        track_grad_norm=2,
        checkpoint_callback=checkpoint_callback,
        print_nan_grads=True,
        weights_summary='full',
        default_save_path=hparams.checkpoint_path,
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=hparams.gpus,
        fast_dev_run=hparams.dev_run)

It's not saving the model weights.
Even leaving the checkpoint_callback in default mode it' not saving the weights.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Linux
os kernel version: 21~18.04.1-Ubuntu SMP Mon Oct 7 04:51:28 UTC 2019
os release version: 5.0.0-1021-gcp
os platform: Linux-5.0.0-1021-gcp-x86_64-with-Ubuntu-18.04-bionic
pytorch-lightning : 0.5.3
PyTorch Version : 1.3.0
== cuda libs ===================================================
/usr/local/cuda-10.0/doc/man/man7/libcudart.7
/usr/local/cuda-10.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.0/lib64/libcudart_static.a
/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130

bug / fix

Source

suvojit-0x55aa

Most helpful comment

I'm on my phone right now so I can't verify, but try changing monitor='val_loss', to monitor='avg_val_loss',

jeffling on 15 Nov 2019

❤3

All 7 comments

1) Is the run completely finishing? The default callback saves at the end of an epoch.
2) You have val_loss set. Are you sure you're logging it correctly? Check for an error message like Can save best model...

jeffling on 15 Nov 2019

@jeffling yes the run completely finishes

My two function are setup like this:

def validation_step(self, batch, batch_nb):
        input, label = batch
        sel, mask, alpha, predgrid, _ = self.forward(input)
        loss, l1, l2, l3, l4 = self.criterion(sel, label, alpha, predgrid)
        self.last_val_images = input

        log_dict = {
            'valid/loss': loss,
            'valid/euclidean_loss': l1,
            'valid/sse_loss': l2,
            'valid/siamese_loss': l3,
            'valid/symmetry_loss': l4
        }

        return {'val_loss': loss, 'log': log_dict}

def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        log_dict = {
            'valid/loss':
            avg_loss,
            'valid/euclidean_loss':
            torch.stack(
                [x['log']['valid/euclidean_loss'] for x in outputs]).mean(),
            'valid/sse_loss':
            torch.stack([x['log']['valid/sse_loss'] for x in outputs]).mean(),
            'valid/siamese_loss':
            torch.stack(
                [x['log']['valid/siamese_loss'] for x in outputs]).mean(),
            'valid/symmetry_loss':
            torch.stack(
                [x['log']['valid/symmetry_loss'] for x in outputs]).mean()
        }

        return {'avg_val_loss': avg_loss, 'log': log_dict}

Is this the correct way to do it ?

suvojit-0x55aa on 15 Nov 2019

I'm on my phone right now so I can't verify, but try changing monitor='val_loss', to monitor='avg_val_loss',

jeffling on 15 Nov 2019

❤3

@jeffling @williamFalcon making the change worked but it throws up the following error.

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 211, in run
    self._record_writer.write(data)
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/summary/writer/record_writer.py", line 39, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 474, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 139, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 143, in _write
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'/home/ubuntu/3DMMasSTN-Pytorch/weights/lightning_logs/version_0/tf/events.out.tfevents.1574084352.gpunew9.4187.0'

My hparams.checkpoint_path is actually a dir like './weights'
Is there some way to save it in version_0 directory ? Also according to the docs model should check point automatically without and explicit
trainer = Trainer(checkpoint_callback=checkpoint_callback)
option in the trainer.

suvojit-0x55aa on 18 Nov 2019

Turns out changing,

def validation_end(self, outputs):
        ...
        return {'avg_val_loss': avg_loss, 'log': log_dict}

def validation_end(self, outputs):
        ...
        return {'val_loss': avg_loss, 'log': log_dict}

starts checkpointing without explicitly passing the checkpoint_callback option. This should be added to docs.

suvojit-0x55aa on 18 Nov 2019

So for the checkpoint_callback to work, we don't need to pass it to the trainer?

jiequanz on 19 Nov 2019

@Jiequannnnnnnnnn yes

suvojit-0x55aa on 21 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add "epoch" options to basic templates

baeseongsu · 3Comments

NumpyMetric not mapping back to GPU in multi-GPU training

jcreinhold · 3Comments

'LightningTemplateModel' object has no attribute '_lazy_train_dataloader'

chuong98 · 3Comments

speed-up too long tests

edenlightning · 3Comments

Early stopping + checkpoint key

williamFalcon · 3Comments