You can reproduce in 4 minutes on 0.9.0.
I tried master and got an unrelated wandb error and gave up trying to reproduce there.
you must be on a machine with multiple gpus
git clone [email protected]:huggingface/transformers.git
cd transformers
pip install -e .
pip install -e .[examples] # installs pytorch-lightning==0.8.5
git checkout pl-checkpoint-bug
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export MAX_LEN=128
export m=sshleifer/student_marian_en_ro_6_3
python finetune.py \
--learning_rate=3e-4 \
--do_train \
--do_predict \
--fp16 \
--val_check_interval 0.25 \
--data_dir wmt_en_ro \
--max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
--freeze_encoder --freeze_embeds \
--train_batch_size=64 --eval_batch_size=64 \
--tokenizer_name $m --model_name_or_path $m \
--warmup_steps 500 --sortish_sampler --logger_name wandb \
--fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
--model_name_or_path $m --gpus 8 --num_train_epochs=1 \
--data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10
ls dmar_pl_only_v3/*.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:58 dmar_pl_only_v3/val_avg_bleu=23.3951-step_count=5.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:57 dmar_pl_only_v3/val_avg_bleu=23.2619-step_count=4.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.6724-step_count=3.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.2664-step_count=2.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:55 dmar_pl_only_v3/val_avg_bleu=23.2263-step_count=1.ckpt
There are 5 checkpoints which much lower scores. PL thinks the best checkpoint is from step 5, but
cat dmar_pl_only_v3/metrics.json | grep bleu
"val_avg_bleu": 26.4513,
"val_avg_bleu": 25.5289,
"val_avg_bleu": 25.6942,
"val_avg_bleu": 26.2227,
"val_avg_bleu": 25.8546,
(the best checkpoint is step 1)
When I evaluate offline on the best checkpoint without truncation, I get val_bleu = 27+, which makes me nearly certain that the numbers in metrics.json (which I create and save in finetune.py are correct and the numbers in the saved paths are incorrect.)
Is this a known issue with a workaround? How can I fix? Should be high priority because suboptimal checkpoint saving is a huge productivity drain.
@sshleifer verifying this but you might be right, even I felt in one of my previous training runs that the epoch with minimum validation wasn't saved.
Where in the PL code does trainer.callback_metrics gather data from all nodes?
CC: @justusschock @SkafteNicki
@sshleifer if you use our metrics they have built-in ddp reduction (each metric separately)
Do you have a working example of how that would work for ROUGE/BLEU?
@sshleifer looked at my logs, my confusion was because I was looking at online fine tuning loss values for pre-training, checking this for BLEU. BLEU needs to be verified for DDP sync right now, this is something we are already working on.
OK, so in the current state of the code, is trainer.callback_metrics always from rank 0, from the last rank to finish, or from the first rank to finish?
@justusschock @ananyahjha93 just wanted to clarify that this has nothing to do with the metric package, since this is calculating blue score using another package.
@sshleifer mind trying master? we added a lot of tests to model checkpoint and cleaned up the logging API.
1.0 will be released in a few days with these changes.
0.10.0rc1 is now available.
Let's see how it works with the new Metric API made by @ananyahjha93 and @teddykoker
@SeanNaren it may be solved, but it would be nice to add a test for this case to prevent it in future
I can confirm on pytorch-lightning master and transformers master metrics/ckpts are in sync. Modified the cmd slightly by disabling wandb + fixed data_dir path.
python finetune.py \
--learning_rate=3e-4 \
--do_train \
--fp16 \
--val_check_interval 0.25 \
--data_dir wmt_en_ro \
--max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
--freeze_encoder --freeze_embeds \
--train_batch_size=64 --eval_batch_size=64 \
--tokenizer_name $m --model_name_or_path $m \
--warmup_steps 500 --sortish_sampler \
--fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
--model_name_or_path $m --gpus 8 --num_train_epochs=1 \
--data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10
~/transformers/examples/seq2seq$ ls dmar_pl_only_v3/*.ckpt
'dmar_pl_only_v3/val_avg_bleu=21.3473-step_count=1.ckpt'
'dmar_pl_only_v3/val_avg_bleu=21.5114-step_count=2.ckpt'
'dmar_pl_only_v3/val_avg_bleu=23.1029-step_count=3.ckpt'
'dmar_pl_only_v3/val_avg_bleu=23.2499-step_count=4.ckpt'
md5-7120d7f030475a280884d779ef85265d
cat dmar_pl_only_v3/metrics.json | grep bleu
"val_avg_bleu": 21.3473,
"val_avg_bleu": 21.51145,
"val_avg_bleu": 23.10295,
"val_avg_bleu": 23.24995,
md5-9acf22c4973059001f79cfcb7003e06a
Traceback (most recent call last):
File "/home/jovyan/transformers/examples/seq2seq/finetune.py", line 440, in <module>
main(args)
File "/home/jovyan/transformers/examples/seq2seq/finetune.py", line 376, in main
raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
ValueError: Output directory (dmar_pl_only_v3) already exists and is not empty.
As @Borda said I'll get a test in place to ensure the file path metrics are correct!
On pl=master in my multigpu unittest, getting
trainer.test()
ERR: File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in __test_using_best_weights
ERR: 'ckpt_path is "best", but ModelCheckpoint is not configured to save the best model.'
ERR: pytorch_lightning.utilities.exceptions.MisconfigurationException: ckpt_path is "best", but ModelCheckpoint is not configured to save the best model.`
There is 1 checkpoint in the save directory at the correct time.
Does ModelCheckpoint need to be configured differently?
have you set a monitor key?
only with this will it be able to track best
yes, we removed the magical “val_loss” key for 1.0.
by default without a checkpoint CB we save the last always.
to change that you have to now pass in a callback with the monitor keyword.
it makes it less magical now and more transparent. also opens the door to multiple checkpoint callbacks soon
Yes I have a checkpoint callback with a monitor
checkpoint_callback = ModelCheckpoint(
filepath=os.path.join(output_dir, exp),
monitor=f"val_{metric}",
mode="min" if "loss" in metric else "max",
save_top_k=save_top_k,
period=0, # Interval (number of val checks) between checkpoints.
)
It correctly saves checkpoints but trainer.test doesn't recognize it for some reason.
do you run ddp? how do I reproduce it?
@sshleifer can you reproduce using the BoringModel?
if you submit a failing test for this using BoringModel we can fix in a few mins
Tried and gave up after 30 mins.
Had to guess install instructions deal (after scanning README) with failing horovod install. (I blindly ran pip install -r requirements/devel.txt) Then I tried to add to test_ddp and test_trainer but my test was failing for other reasons than the bug I was trying to isolate. (I'm not whining just telling you cause I would want you to tell me if the roles were reversed.)
Is there a test that uses:
DDP, BoringModel, and ModelCheckpoint and calls trainer.test at the end?
If that exists and passes, I won't be able to replicate my issue.
if it doesn't exist but passes: you increased the likelihood that my issue is my fault and improved your test coverage.
yup! this test
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/backends/test_ddp.py#L44-L57
And... to be extra safe ;) it even makes sure it gets 90+ acc
@sshleifer I can confirm this works using the below boring model test:
import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
"""
Testing PL Module
Use as follows:
- subclass
- modify the behavior for what you want
class TestModel(BaseTestModel):
def training_step(...):
# do your own thing
or:
model = BaseTestModel()
model.training_epoch_end = None
"""
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def step(self, x):
x = self.layer(x)
out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
return out
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('loss', loss)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('x', loss)
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
self.log('y', loss)
def configure_optimizers(self):
optimizer = torch.optim.AdamW(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def run_test():
# fake data
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
# model
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=4,
limit_val_batches=2,
max_epochs=1,
accelerator='ddp',
gpus=2,
checkpoint_callback=ModelCheckpoint(
monitor='x',
mode='min',
save_top_k=1
)
)
trainer.fit(model, train_data, val_data)
trainer.test()
if __name__ == '__main__':
run_test()
I think the error message is a bit ambiguous, but the issue might be not calling self.log to report the metrics we'd like to save on.