After running training the output file structure looks like
epoch=9_vl_val_loss=10.10.ckpt
lightning_logs/
โโโ version_0
โ โโโ events.out.tfevents.1585053395.dltn.22357.0
โ โโโ meta_tags.csv
but the expected file structure looks like
lightning_logs/
โโโ version_0
โ โโโ events.out.tfevents.1585053395.dltn.22357.0
โ โโโ meta_tags.csv
โ โโโ epoch=9_vl_val_loss=10.10.ckpt
Steps to reproduce the behavior:
#!/usr/bin/env python
"""checkpoint_demo.py"
from torch.utils import data
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_lightning import Trainer
from pytorch_lightning import LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
class ConstantDataset(data.Dataset):
def __len__(self): return 6
def __getitem__(self, idx):
c = torch.tensor(7.0, dtype=torch.float)
return c, c
class CheckpointDemo(LightningModule):
def __init__(self):
super(CheckpointDemo, self).__init__()
self.linear = nn.Linear(1, 1)
@staticmethod
def createModelCheckpoint():
return ModelCheckpoint(monitor='val_loss', mode='min',
filepath='./{epoch}_vl_{val_loss:.2f}',
# filepath='{epoch}_vl_{val_loss:.2f}', # if just filename it raises exception
# "/workspace/oplatek/code/.../venv/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py",
# os.makedirs(self.dirpath, exist_ok=True)
# File "/workspace/bin/anaconda3/lib/python3.6/os.py", line 220, in makedirs
# mkdir(name, mode)
# FileNotFoundError: [Errno 2] No such file or directory: ''
save_weights_only=False,
verbose=True)
def forward(self, x):
return self.linear(x)
def train_dataloader(self):
return data.DataLoader(ConstantDataset(), batch_size=1)
def val_dataloader(self):
return data.DataLoader(ConstantDataset(), batch_size=1)
def configure_optimizers(self):
return optim.Adam(self.parameters(), lr=1.0)
def validation_epoch_end(self, outputs):
val_loss = torch.stack([o['val_loss'] for o in outputs]).mean()
return {'val_loss': val_loss, 'log': {'val_loss': val_loss}}
def training_step(self, batch, batch_idx):
x, y = batch
return {f'loss': torch.nn.functional.mse_loss(self.forward(x), y)}
def validation_step(self, batch, batch_idx):
return {f'val_loss': torch.tensor(10 + (1 / (self.current_epoch + 1)))}
if __name__ == "__main__":
model = CheckpointDemo()
trainer = Trainer(max_epochs=10, checkpoint_callback=CheckpointDemo.createModelCheckpoint())
trainer.fit(model)
Two questions about this bug:
If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable? The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.
The commented line is an empty string because it is missing the f in the f-string f"
Possible Duplicate of #1207
@TylerYep regarding 2. it is not empty string even if the f is missing - it just won't be substituted - it won't be empty.
It is because there is no dirname in the path and os.makedir is called.
In any case, PL does the substitution for both cases
@TylerYep About duplicate. You are right! It is Duplicate of https://github.com/PyTorchLightning/pytorch-lightning/issues/1207
@TylerYep
I think the best way is something you might be suggesting:
If ModelCheckpoint saves to the lightning_log, you will be unable to specify a way to save a file to any other location - would this be preferable?
NO, for some particular reasons (e.g. debugging) I would love to maintain the current flexibility
The current API allows you to specify any location to add it to, including the lightning_log/version of your choice.
Right - it would be great - if I can *easily obtain the path to current lightning_log/version_XY and set it up explicitly*
Can you show me how to do it?
If it is possible right now - I would suggest it to make it the default path.
pls move the discussion to the #1207