Pytorch-lightning: Save checkpoing under the lightning_logs/version_X/ directory

Created on 24 Mar 2020 · 5Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

After running training the output file structure looks like

epoch=9_vl_val_loss=10.10.ckpt
lightning_logs/
├── version_0
│   ├── events.out.tfevents.1585053395.dltn.22357.0
│   └── meta_tags.csv

but the expected file structure looks like

lightning_logs/
├── version_0
│   ├── events.out.tfevents.1585053395.dltn.22357.0
│   └── meta_tags.csv  
│   └── epoch=9_vl_val_loss=10.10.ckpt

To Reproduce

Steps to reproduce the behavior:

Use PyTorch 1.4 and PL 0.7.1
Run the following snippet "checkpoint_demo.py"

Code sample

#!/usr/bin/env python
"""checkpoint_demo.py"
from torch.utils import data
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning import Trainer
from pytorch_lightning import LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint


class ConstantDataset(data.Dataset):
    def __len__(self): return 6
    def __getitem__(self, idx):
        c = torch.tensor(7.0, dtype=torch.float)
        return c, c

class CheckpointDemo(LightningModule):
    def __init__(self):
        super(CheckpointDemo, self).__init__()
        self.linear = nn.Linear(1, 1)

    @staticmethod
    def createModelCheckpoint():
        return ModelCheckpoint(monitor='val_loss', mode='min',
                               filepath='./{epoch}_vl_{val_loss:.2f}',
                               # filepath='{epoch}_vl_{val_loss:.2f}',  # if just filename it raises exception
                               # "/workspace/oplatek/code/.../venv/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py",
                               #     os.makedirs(self.dirpath, exist_ok=True)
                               #   File "/workspace/bin/anaconda3/lib/python3.6/os.py", line 220, in makedirs
                               #     mkdir(name, mode)
                               # FileNotFoundError: [Errno 2] No such file or directory: ''
                               save_weights_only=False,
                               verbose=True)

    def forward(self, x):
        return self.linear(x)

    def train_dataloader(self):
        return data.DataLoader(ConstantDataset(), batch_size=1)

    def val_dataloader(self):
        return data.DataLoader(ConstantDataset(), batch_size=1)

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=1.0)

    def validation_epoch_end(self, outputs):
        val_loss = torch.stack([o['val_loss'] for o in outputs]).mean()
        return {'val_loss': val_loss, 'log': {'val_loss': val_loss}}

    def training_step(self, batch, batch_idx):
        x, y = batch
        return {f'loss': torch.nn.functional.mse_loss(self.forward(x), y)}

    def validation_step(self, batch, batch_idx):
        return {f'val_loss': torch.tensor(10 + (1 / (self.current_epoch + 1)))}


if __name__ == "__main__":
    model = CheckpointDemo()
    trainer = Trainer(max_epochs=10, checkpoint_callback=CheckpointDemo.createModelCheckpoint())
    trainer.fit(model)

duplicate help wanted question

Source

oplatek

All 5 comments

Two questions about this bug:

If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable? The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.
The commented line is an empty string because it is missing the f in the f-string f"". This is why the file cannot save. Once I add the f, epoch is not a defined variable. Does this fix that particular error?

Possible Duplicate of #1207

TylerYep on 26 Mar 2020

👍1

@TylerYep regarding 2. it is not empty string even if the f is missing - it just won't be substituted - it won't be empty.
It is because there is no dirname in the path and os.makedir is called.

In any case, PL does the substitution for both cases

oplatek on 26 Mar 2020

@TylerYep About duplicate. You are right! It is Duplicate of https://github.com/PyTorchLightning/pytorch-lightning/issues/1207

oplatek on 26 Mar 2020

👍1

@TylerYep
I think the best way is something you might be suggesting:

If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable?

NO, for some particular reasons (e.g. debugging) I would love to maintain the current flexibility

The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.

Right - it would be great - if I can *easily obtain the path to current lightning_log/version_XY and set it up explicitly*
Can you show me how to do it?
If it is possible right now - I would suggest it to make it the default path.

oplatek on 26 Mar 2020

pls move the discussion to the #1207