I've trained a system as follows:
model = CoolSystem(hparams)
checkpoint_callback = pl.callbacks.ModelCheckpoint(
filepath=os.path.join(os.getcwd(), 'checkpoints'),
verbose=True,
monitor='val_acc',
mode='max',
prefix='try',
save_top_k=-1,
period=1
)
trainer = pl.Trainer(
max_epochs=hparams.epochs,
checkpoint_callback=checkpoint_callback)
trainer.fit(model)
trainer.test()
And with the above, the test accuracy is 0.7975
However, when I load the checkpoints separately instead:
model_test = CoolSystem(hyperparams)
model_test.load_from_checkpoint('checkpoints/try_ckpt_epoch_1.ckpt')
trainer = pl.Trainer()
trainer.test(model_test)
The accuracy returned is 0.5705
Am I loading the checkpoints wrongly?
Is this the CoolSystem
from the examples or is it something custom? Basically you need to tell us how to reproduce it.
Maybe there is something wrong in how you accumulate the test acc? Do you get the same values if you run
trainer.test(model_test)
multiple times (for both cases)?
@awaelchli, here is my CoolSystem
for reference:
class CoolSystem(pl.LightningModule):
def __init__(self, hparams):
super(CoolSystem, self).__init__()
self.hparams = hparams
self.data_dir = self.hparams.data_dir
self.num_classes = self.hparams.num_classes # len(self.hparams.classes)
########## define the model ##########
arch = torchvision.models.resnet18(pretrained=True)
num_ftrs = arch.fc.in_features
modules = list(arch.children())[:-1]
self.backbone = torch.nn.Sequential(*modules) # [bs, 512, 1, 1]
self.final = torch.nn.Sequential(
torch.nn.Linear(num_ftrs, 128),
torch.nn.ReLU(inplace=True),
torch.nn.Linear(128, self.num_classes),
torch.nn.Softmax(dim=1))
########## /define the model ##########
self.training_acc_across_batches_at_curr_epoch = []
def forward(self, x):
x = self.backbone(x)
x = x.reshape(x.size(0), -1)
x = self.final(x)
return x
def configure_optimizers(self):
# REQUIRED
# can return multiple optimizers and learning_rate schedulers
# (LBFGS it is automatically supported, no need for closure function)
optimizer = torch.optim.SGD(list(self.backbone.parameters()) + list(self.final.parameters()), lr=0.001, momentum=0.9)
exp_lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
return [optimizer], [exp_lr_scheduler]
def training_step(self, batch, batch_idx):
# REQUIRED
x, y = batch
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, y)
_, preds = torch.max(y_hat, dim=1)
acc = torch.sum(preds == y.data) / (y.shape[0] * 1.0)
# torch.tensor(acc)
self.training_acc_across_batches_at_curr_epoch.append(acc.item())
return {'loss': loss} #, 'train_acc': acc}
def on_epoch_end(self):
train_acc_mean = np.mean(self.training_acc_across_batches_at_curr_epoch)
self.logger.experiment.add_scalar('train_acc', train_acc_mean, global_step=self.current_epoch)
self.training_acc_across_batches_per_epoch = [] # reset for next epoch
def validation_step(self, batch, batch_idx):
# OPTIONAL
x, y = batch
y_hat = self.forward(x)
_, preds = torch.max(y_hat, 1)
acc = torch.sum(preds == y.data) / (y.shape[0] * 1.0)
return {'val_loss': F.cross_entropy(y_hat, y), 'val_acc': acc}
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
avg_acc = torch.stack([x['val_acc'].float() for x in outputs]).mean()
print ("val accuracy at end of epoch: ", avg_acc)
logs = {'val_loss': avg_loss, 'val_acc': avg_acc}
return {'avg_val_loss': avg_loss, 'avg_val_acc': avg_acc, 'log': logs, 'progress_bar': logs}
def test_step(self, batch, batch_idx):
# OPTIONAL
x, y = batch
y_hat = self.forward(x)
_, preds = torch.max(y_hat, 1)
acc = torch.sum(preds == y.data) / (y.shape[0] * 1.0)
return {'test_loss': F.cross_entropy(y_hat, y), 'test_acc': acc}
def test_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
avg_acc = torch.stack([x['test_acc'].float() for x in outputs]).mean()
logs = {'test_loss': avg_loss, 'test_acc': avg_acc}
print ('#### Average accuracy: ', avg_acc)
return {'avg_test_loss': avg_loss, 'avg_test_acc': avg_acc, 'log': logs, 'progress_bar': logs}
@pl.data_loader
def train_dataloader(self):
# REQUIRED
transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
train_set = torchvision.datasets.ImageFolder(os.path.join(self.data_dir, 'train'), transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=4)
# train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
# train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=2)
return train_loader
@pl.data_loader
def val_dataloader(self):
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
val_set = torchvision.datasets.ImageFolder(os.path.join(self.data_dir, 'val'), transform)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=32, shuffle=True, num_workers=4)
# val_set = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
# val_loader = torch.utils.data.DataLoader(val_set, batch_size=4, shuffle=False, num_workers=2)
return val_loader
@pl.data_loader
def test_dataloader(self):
# The test set and validation set are the same so that I am able to check that the validation and test accuracy are equal.
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
val_set = torchvision.datasets.ImageFolder(os.path.join(self.data_dir, 'val'), transform)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=4, shuffle=True, num_workers=4)
return val_loader
I've also put my code into a Colab, if you would like to run it: https://colab.research.google.com/drive/1JUQctrkFKMJSPU2u5061534HzF27WFRE (the relevant part is in the first half of the notebook, up to 'For checking purposes'; the rest are messy and irrelevant, apologies)
Am I doing something wrong here, causing the accuracy returned to be different when I load the checkpoints separately instead of doing .test()
immediately after .fit()
?
Change
model_test.load_from_checkpoint('checkpoints/try_ckpt_epoch_1.ckpt')
to
model_test = CoolSystem.load_from_checkpoint('checkpoints/try_ckpt_epoch_1.ckpt')
and it will work (99.9% sure).
Note that load function is a class method. It will load the exact same hyperparameters from the checkpoint that you used for training.
Thanks, @awaelchli! Amended my code based on your suggestion and noted a strange behavior: during .fit()
, the val accuracy is 0.8620 but upon loading the chekcpoints and doing .test(model_test)
, the test accuracy is 0.8813. This is strange to me as I'm using the same dataset for both val and test as a check.
Your notebook is quite big. It would be good if you could try to create a minimal example where this behavior can be observed, then it is easier for us to help. The difference in accuracy you report is quite small. Could it simply be due to randomization of the data? To make sure this doesn't happen, make the code deterministic: https://pytorch.org/docs/stable/notes/randomness.html
I'm loading my checkpoints for inference as per what @awaelchli suggested above:
from argparse import Namespace
args = {
'num_classes': 2,
'epochs': 2,
'data_dir': "/content/hymenoptera_data",
}
hyperparams_test = Namespace(**args)
model_test = CoolSystem.load_from_checkpoint('checkpoints/try_ckpt_epoch_1.ckpt')
trainer = pl.Trainer(weights_summary=None)
trainer.test(model_test)
However, I am now getting an error:
AttributeError: 'dict' object has no attribute 'data_dir'
Even with model_test = CoolSystem(hyperparams_test).load_from_checkpoint('checkpoints/try_ckpt_epoch_1.ckpt')
, PyTorch Lightning is still complaining that 'dict' object has no attribute 'data_dir'
Am I doing something wrong here?
I think the example shown in Checkpoint Loading only works if the user is loading the checkpoints immediately after training (i.e. the instance of MyLightningModule
has already been instantiated with the necessary hparams
).
However, I'm loading the checkpoints separately (i.e. not immediately after training), and I had to change __init__
within the module from
class CoolSystem(pl.LightningModule):
def __init__(self, hparams):
super(CoolSystem, self).__init__()
self.hparams = hparams
self.data_dir = self.hparams.data_dir
self.num_classes = self.hparams.num_classes
to
class CoolSystem(pl.LightningModule):
def __init__(self, hparams):
super(CoolSystem, self).__init__()
self.hparams = hparams
self.data_dir = self.hparams['data_dir']
self.num_classes = self.hparams['num_classes']
in order to get the hyperparams to be loaded properly.
your checkpoint was saved with a dictionary. means you likely gave the model a dictionary in init and not a namespace
@williamFalcon Could it be that this line is actually failing to convert the dictionary built by lightning back to a namespace. In particular, I believe that is happening to me because my checkpoint has no value for "hparams_type" which means that _convert_loaded_hparams
gets a None
as the second argument and returns the dictionary. In other words, the hparams in my checkpoint were indeed a namespace, but when trying to load the checkpoint, lightning builds a dictionary and it looks like the mechanism for changing it back to a namespace isn't working for me.
Possible work-arounds:
hparams_type = 'hparams_name'
tricks the conversion to return an AttributeDict
(which is like a Namespace) [this isn't a reasonable work around, but I'm including it in case it helps to clarify what is going on]__init__
and explicitly converting to a Namespace
if necessary: if isinstance(hparams, dict):
hparams = argparse.Namespace(**hparams)
(fwiw: I may be doing other things wrong, because my __init__
doesn't get any of the other arguments during reconstruction that it is supposed to get.) :)
I think I am facing a similiar issue. When I test a model which I load from a checkpoint file the results are quite bad.
class TransferLearningModel(LightningModule):
def __init__(self, config):
super().__init__()
self.config = config
# save config dict when saving model
self.save_hyperparameters(config)
....
self.hparams.learning_rate = self.config["optimizer_initial_lr"]
......
config = {
"model_name": "resnext101_32x8d",
"amp_optimization_level": "O1",
...
}
model = TransferLearningModel(config)
trainer = Trainer(...)
trainer.fit(model)
# testing
trainer.test(ckpt_path = filepath)
Any idea what I might be doing from?
I still face this problem in 0.8.5. I can't restore my LightningModule.
Even though I saved self.hparams as a Namespace, I found out Lightning loaded hparams back in as a dictionary. I discovered this through stepping through my debugger. Does this behavior still happen in 0.9+?
I resorted to the suggested above:
if isinstance(hparams, dict):
hparams = argparse.Namespace(**hparams)
No, as far as I know we work around that by saving the hparams type also into the checkpoint so we know when we should convert it back to namespace when loading.
I also have the same problem. During initializing the training a Namespace
is provided but a dict is saved on the checkpoint, and also a dict is loaded with load_from_checkpoint
. Version 0.8.5.
I have again verified that on the latest version it works. Please install 0.9.0rc12. If the problem persists, may I ask for a minimal code sample to reproduce the issue?
Hello! @awaelchli
Facing the same issue on 0.9.0rc16. I mean hparams during training is a Namespace, but is passed in as a dict when loading from checkpoint. I'll post a complete sample shortly, but as far as I can tell, my checkpoint doesn't have the key hparams_type
:
ckpt_path = <path to my checkpoint on my local system>
ckpt = torch.load(ckpt_path)
print(ckpt.keys())
yields the following:
dict_keys(['epoch', 'global_step', 'pytorch-lightning_version', 'checkpoint_callback_best_model_score', 'checkpoint_callback_best_model_path', 'optimizer_states', 'lr_schedulers', 'state_dict', 'hparams_name', 'hyper_parameters'])
For reference, the checkpoint was saved using automatic checkpointing by specifying the following in validation_step()
:
loss = ...
result = pl.EvalResult(checkpoint_on=loss)
Did some snooping around in source, and I think this is the relevant snippet in trainer_io.py
:
if model.hparams:
if hasattr(model, '_hparams_name'):
checkpoint[LightningModule.CHECKPOINT_HYPER_PARAMS_NAME] = model._hparams_name
# add arguments to the checkpoint
checkpoint[LightningModule.CHECKPOINT_HYPER_PARAMS_KEY] = model.hparams
if OMEGACONF_AVAILABLE:
if isinstance(model.hparams, Container):
checkpoint[LightningModule.CHECKPOINT_HYPER_PARAMS_TYPE] = type(model.hparams)
I haven't been able to verify if the OMEGACONF_AVAILABLE
or the isinstance
check is failing, but the hparams type didn't get saved to the checkpoint. Can we please reopen this issue? I can try and help figure out the issue further if needed.
Me too, I confirm I still get this on 0.9.0. Just using argparse Namespace. Please reopen.
0.9.0 stills a problem..
Here is my pretty dirty solution to load this checkpoint left by ModelCheckpoint callback:
def weights_update(model, checkpoint):
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in checkpoint['state_dict'].items() if k in model_dict}
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
return model
model = weights_update(model=EfficientNet(...), checkpoint=torch.load(checkpoint_path))
Still having this issue in 0.9.0. The issue comes from here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b40de5464a953ff5866a255f4670d318bd8fd65a/pytorch_lightning/core/saving.py#L169
In fact, the hparams saved in checkpoint is indeed a Namespace. However, the CHECKPOINT_HYPER_PARAMS_TYPE
was never saved. So this line of code cannot correctly changes the type of model_args
. So the user model was unexpectedly initialized with a dict.
I think this bug has to be fixed with very high priority since in fact loading trained models is one of the most basic things people need for any DL projects.
@magic282 that's great to hear, at least it's reproducible then, I think I also found that the type didn't seem to be saved in my checkpoint in the answer I'm quoting. Could you verify if the OMEGA_CONF check of the isinstance check failed in your case as I mentioned in my comment above? (referencing here for convenience)
Did some snooping around in source, and I think this is the relevant snippet in
trainer_io.py
:if model.hparams: if hasattr(model, '_hparams_name'): checkpoint[LightningModule.CHECKPOINT_HYPER_PARAMS_NAME] = model._hparams_name # add arguments to the checkpoint checkpoint[LightningModule.CHECKPOINT_HYPER_PARAMS_KEY] = model.hparams if OMEGACONF_AVAILABLE: if isinstance(model.hparams, Container): checkpoint[LightningModule.CHECKPOINT_HYPER_PARAMS_TYPE] = type(model.hparams)
I haven't been able to verify if the
OMEGACONF_AVAILABLE
or theisinstance
check is failing, but the hparams type didn't get saved to the checkpoint. Can we please reopen this issue? I can try and help figure out the issue further if needed.
Still having this issue in 0.9.0. The issue comes from here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/b40de5464a953ff5866a255f4670d318bd8fd65a/pytorch_lightning/core/saving.py#L169In fact, the hparams saved in checkpoint is indeed a Namespace. However, the
CHECKPOINT_HYPER_PARAMS_TYPE
was never saved. So this line of code cannot correctly changes the type ofmodel_args
. So the user model was unexpectedly initialized with a dict.I think this bug has to be fixed with very high priority since in fact loading trained models is one of the most basic things people need for any DL projects.
This issue is still there on 0.10.0.
could you try 1.0.0rc4?
looks like this still exists in 1.0.2
I am also facing the same problem. I gave a post in the forum here.
I am not sure, my code is wrong or it is a bug!
Updated: I have solved my problem. You can search in the link (above)
Could this have the "bug" label added? Seems like unintended behavior.
@andrewjong Ok, I'm adding the bug label, but I read the thread again, and I still cannot identify what the issue is here, sorry.
about incorrect test results reported in this post by @polars05, I looked again at their colab that was shared here, I updated the syntax to the latest Lightning, and ran it. https://colab.research.google.com/drive/1HXAQCk0vrY2JKsnX2mQ2QxpsKB_zZy1z?usp=sharing
I get the exact same test loss and test accuracy when loading the checkpoint and testing again.
Is the hparam loading issue you are facing here the same as in #3998? It looks like we have a reproducible code for that one.
@awaelchli sorry for being unclear; the specific issue is the hparams bug in #3998 yes. Looks like that got added in the past two weeks. Thanks for linking to it. We should probably focus on that one as it's clearly about the hparams loading issue.
ok, thank you! Let's close this one here as I believe we have answered and resolved the issues of the main author about how to load the checkpoint and we also get the expected test values.
I will get back to the linked issue and try to resolve it.
@awaelchli hey! I opened and diagnosed #3998, and if you follow the discussion there I've linked to exactly why that's happening, which is that the linked code parses the class spec to find ".hparams=", which would fail for inherited classes. So in some sense I've solved and demonstrated why it's happening. As such that issue is related to hparams_name
.
The current issue has nothing to do with it. If you look at my comment above in this thread, I've linked to the issue being that the hparams_type
missing in the checkpoint, and I've also linked to the code where this is failing in my comment above. I link this issue in the other one as well.
Please reopen this issue since the other one is a distinct one. I've taken the time to dig in and isolate where both are happening and they aren't the same. I understand it's hard to read entire comment threads, but it's also a little frustrating when issues get closed like this after trying to help isolate the problem :)
import os
from argparse import Namespace
import torch
from torch.utils.data import Dataset
from pytorch_lightning import Trainer, LightningModule
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self, hparams):
super().__init__()
self.hparams = hparams
self.layer = torch.nn.Linear(32, 2)
print("type of hparams", type(self.hparams))
print("class of hparams type", self.hparams.__class__.__name__)
print("accessing hparams", self.hparams.something, "no problem")
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def step(self, x):
x = self.layer(x)
out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
return out
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def validation_epoch_end(self, outputs) -> None:
torch.stack([x['x'] for x in outputs]).mean()
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"y": loss}
def test_epoch_end(self, outputs) -> None:
torch.stack([x["y"] for x in outputs]).mean()
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def run_test():
# fake data
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
test_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
# model
model = BoringModel(hparams=Namespace(something=1))
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
max_epochs=1,
weights_summary=None,
)
trainer.fit(model, train_data, val_data)
# Try to load with hparams
ckpt_path = trainer.checkpoint_callback.best_model_path
model = BoringModel.load_from_checkpoint(ckpt_path)
ckpt = torch.load(ckpt_path)
print(ckpt.keys())
print("hparams name saved to ckpt:", ckpt["hparams_name"])
print("hparams contents saved to ckpt:", ckpt["hyper_parameters"])
print("hparams type saved to ckpt:", type(ckpt["hyper_parameters"]))
trainer.test(model, test_dataloaders=test_data)
if __name__ == '__main__':
run_test()
I have once again tried to use the hparams and it works. Please take this code and finally tell me what I need to modify to make it break. If there is an issue with hparams loading, then it should be taken to a new issue. As far as I can tell, the problem of OP was solved, was it not? That is the reason I closed it.
I understand it's hard to read entire comment threads, but it's also a little frustrating when issues get closed like this after trying to help isolate the problem :)
I'm sorry but the frustration is also on my side. I have revisited this thread multiple times, I have helped debug the OPs colab and later I have converted it to the latest PL version and showed that the loaded values are the correct ones. Then this issue transformed into hparams loading, and I am simply not able to guess how to reproduce. Please, if it's not too much to ask
@awaelchli I didn't mean to invalidate your frustration, I think it's possible for both sides to be validly frustrated for different reasons. In my case, I tried pointing out it's a problem with the saving rather than the loading, which got buried in the discussion, only to see it closed in reference to an issue that I opened and diagnosed already to not be related to this.
I see your point about why a separate issue might help, and created one here: #4333
The bug report code unfortunately doesn't work for me, but we can transfer the discussion there and you can close this again if it helps.
Most helpful comment
0.9.0 stills a problem..