Describe the bug
Model checkpoint is not working, even with explicit checkpoint callback.
To Reproduce
Steps to reproduce the behavior:
This is the settings I'm using. hparams.checkpoint_path is actually a dir like './weights'
checkpoint_callback = ModelCheckpoint(
filepath=hparams.checkpoint_path,
save_best_only=True,
verbose=True,
monitor='val_loss',
mode='min',
prefix='')
model = Net3DMMSTN(hparams)
trainer = Trainer(
early_stop_callback=None,
track_grad_norm=2,
checkpoint_callback=checkpoint_callback,
print_nan_grads=True,
weights_summary='full',
default_save_path=hparams.checkpoint_path,
max_nb_epochs=hparams.max_nb_epochs,
gpus=hparams.gpus,
fast_dev_run=hparams.dev_run)
It's not saving the model weights.
Even leaving the checkpoint_callback in default mode it' not saving the weights.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
1) Is the run completely finishing? The default callback saves at the end of an epoch.
2) You have val_loss set. Are you sure you're logging it correctly? Check for an error message like Can save best model...
@jeffling yes the run completely finishes
My two function are setup like this:
def validation_step(self, batch, batch_nb):
input, label = batch
sel, mask, alpha, predgrid, _ = self.forward(input)
loss, l1, l2, l3, l4 = self.criterion(sel, label, alpha, predgrid)
self.last_val_images = input
log_dict = {
'valid/loss': loss,
'valid/euclidean_loss': l1,
'valid/sse_loss': l2,
'valid/siamese_loss': l3,
'valid/symmetry_loss': l4
}
return {'val_loss': loss, 'log': log_dict}
def validation_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
log_dict = {
'valid/loss':
avg_loss,
'valid/euclidean_loss':
torch.stack(
[x['log']['valid/euclidean_loss'] for x in outputs]).mean(),
'valid/sse_loss':
torch.stack([x['log']['valid/sse_loss'] for x in outputs]).mean(),
'valid/siamese_loss':
torch.stack(
[x['log']['valid/siamese_loss'] for x in outputs]).mean(),
'valid/symmetry_loss':
torch.stack(
[x['log']['valid/symmetry_loss'] for x in outputs]).mean()
}
return {'avg_val_loss': avg_loss, 'log': log_dict}
Is this the correct way to do it ?
I'm on my phone right now so I can't verify, but try changing monitor='val_loss', to monitor='avg_val_loss',
@jeffling @williamFalcon making the change worked but it throws up the following error.
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 211, in run
self._record_writer.write(data)
File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/summary/writer/record_writer.py", line 39, in write
self._writer.write(header + header_crc + data + footer_crc)
File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 474, in write
self.fs.append(self.filename, file_content, self.binary_mode)
File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 139, in append
self._write(filename, file_content, "ab" if binary_mode else "a")
File "/home/ubuntu/miniconda3/envs/stn_env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 143, in _write
with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'/home/ubuntu/3DMMasSTN-Pytorch/weights/lightning_logs/version_0/tf/events.out.tfevents.1574084352.gpunew9.4187.0'
My hparams.checkpoint_path is actually a dir like './weights'
Is there some way to save it in version_0 directory ? Also according to the docs model should check point automatically without and explicit
trainer = Trainer(checkpoint_callback=checkpoint_callback)
option in the trainer.
Turns out changing,
def validation_end(self, outputs):
...
return {'avg_val_loss': avg_loss, 'log': log_dict}
to
def validation_end(self, outputs):
...
return {'val_loss': avg_loss, 'log': log_dict}
starts checkpointing without explicitly passing the checkpoint_callback option. This should be added to docs.
So for the checkpoint_callback to work, we don't need to pass it to the trainer?
@Jiequannnnnnnnnn yes
Most helpful comment
I'm on my phone right now so I can't verify, but try changing
monitor='val_loss',tomonitor='avg_val_loss',