Pytorch-lightning: distributed training: ModelCheckpoint is receiving bad data

Created on 22 Sep 2020 · 20Comments · Source: PyTorchLightning/pytorch-lightning

You can reproduce in 4 minutes on 0.9.0.
I tried master and got an unrelated wandb error and gave up trying to reproduce there.

you must be on a machine with multiple gpus

git clone [email protected]:huggingface/transformers.git
cd transformers
pip install -e .
pip install -e .[examples]  # installs pytorch-lightning==0.8.5
git checkout pl-checkpoint-bug
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz

export MAX_LEN=128
export m=sshleifer/student_marian_en_ro_6_3

python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --do_predict \
  --fp16 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name $m --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler --logger_name wandb \
  --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
  --model_name_or_path $m --gpus 8 --num_train_epochs=1 \
  --data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10

Results

ls dmar_pl_only_v3/*.ckpt

-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:58 dmar_pl_only_v3/val_avg_bleu=23.3951-step_count=5.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:57 dmar_pl_only_v3/val_avg_bleu=23.2619-step_count=4.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.6724-step_count=3.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:56 dmar_pl_only_v3/val_avg_bleu=22.2664-step_count=2.ckpt
-rw-r--r-- 1 shleifer shleifer 351351790 Sep 21 23:55 dmar_pl_only_v3/val_avg_bleu=23.2263-step_count=1.ckpt

There are 5 checkpoints which much lower scores. PL thinks the best checkpoint is from step 5, but

cat dmar_pl_only_v3/metrics.json | grep bleu

            "val_avg_bleu": 26.4513,
            "val_avg_bleu": 25.5289,
            "val_avg_bleu": 25.6942,
            "val_avg_bleu": 26.2227,
            "val_avg_bleu": 25.8546,

(the best checkpoint is step 1)

When I evaluate offline on the best checkpoint without truncation, I get val_bleu = 27+, which makes me nearly certain that the numbers in metrics.json (which I create and save in finetune.py are correct and the numbers in the saved paths are incorrect.)

Is this a known issue with a workaround? How can I fix? Should be high priority because suboptimal checkpoint saving is a huge productivity drain.

Additional Notes:

The numbers logged to wandb are also the low/wrong ones.
on 1 or 2 GPU the numbers are identical!

Checkpoint bug / fix help wanted

Source

sshleifer

All 20 comments

@sshleifer verifying this but you might be right, even I felt in one of my previous training runs that the epoch with minimum validation wasn't saved.

ananyahjha93 on 22 Sep 2020

👍1

Where in the PL code does trainer.callback_metrics gather data from all nodes?

sshleifer on 22 Sep 2020

👀1

CC: @justusschock @SkafteNicki

nateraw on 22 Sep 2020

@sshleifer if you use our metrics they have built-in ddp reduction (each metric separately)

justusschock on 22 Sep 2020

Do you have a working example of how that would work for ROUGE/BLEU?

sshleifer on 22 Sep 2020

@sshleifer looked at my logs, my confusion was because I was looking at online fine tuning loss values for pre-training, checking this for BLEU. BLEU needs to be verified for DDP sync right now, this is something we are already working on.

ananyahjha93 on 22 Sep 2020

OK, so in the current state of the code, is trainer.callback_metrics always from rank 0, from the last rank to finish, or from the first rank to finish?

sshleifer on 22 Sep 2020

@justusschock @ananyahjha93 just wanted to clarify that this has nothing to do with the metric package, since this is calculating blue score using another package.

SkafteNicki on 23 Sep 2020

👍1

@sshleifer mind trying master? we added a lot of tests to model checkpoint and cleaned up the logging API.

1.0 will be released in a few days with these changes.

0.10.0rc1 is now available.

williamFalcon on 5 Oct 2020

Let's see how it works with the new Metric API made by @ananyahjha93 and @teddykoker
@SeanNaren it may be solved, but it would be nice to add a test for this case to prevent it in future

Borda on 7 Oct 2020

I can confirm on pytorch-lightning master and transformers master metrics/ckpts are in sync. Modified the cmd slightly by disabling wandb + fixed data_dir path.

python finetune.py \
  --learning_rate=3e-4 \
  --do_train \
  --fp16 \
  --val_check_interval 0.25 \
  --data_dir wmt_en_ro \
  --max_source_length $MAX_LEN --max_target_length $MAX_LEN --val_max_target_length $MAX_LEN --test_max_target_length $MAX_LEN \
  --freeze_encoder --freeze_embeds \
  --train_batch_size=64 --eval_batch_size=64 \
  --tokenizer_name $m --model_name_or_path $m \
  --warmup_steps 500 --sortish_sampler  \
  --fp16_opt_level=O1 --task translation --num_sanity_val_steps=0 \
  --model_name_or_path $m --gpus 8 --num_train_epochs=1 \
  --data_dir wmt_mar_pl --output_dir dmar_pl_only_v3 --save_top_k=10

~/transformers/examples/seq2seq$ ls dmar_pl_only_v3/*.ckpt
'dmar_pl_only_v3/val_avg_bleu=21.3473-step_count=1.ckpt'
'dmar_pl_only_v3/val_avg_bleu=21.5114-step_count=2.ckpt'
'dmar_pl_only_v3/val_avg_bleu=23.1029-step_count=3.ckpt'
'dmar_pl_only_v3/val_avg_bleu=23.2499-step_count=4.ckpt'



md5-7120d7f030475a280884d779ef85265d



cat dmar_pl_only_v3/metrics.json | grep bleu
            "val_avg_bleu": 21.3473,
            "val_avg_bleu": 21.51145,
            "val_avg_bleu": 23.10295,
            "val_avg_bleu": 23.24995,



md5-9acf22c4973059001f79cfcb7003e06a



Traceback (most recent call last):
  File "/home/jovyan/transformers/examples/seq2seq/finetune.py", line 440, in <module>
    main(args)
  File "/home/jovyan/transformers/examples/seq2seq/finetune.py", line 376, in main
    raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
ValueError: Output directory (dmar_pl_only_v3) already exists and is not empty.

As @Borda said I'll get a test in place to ensure the file path metrics are correct!

SeanNaren on 7 Oct 2020

👍1

On pl=master in my multigpu unittest, getting

trainer.test()

ERR:   File "/home/shleifer/miniconda3/envs/nb/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in __test_using_best_weights
ERR:     'ckpt_path is "best", but ModelCheckpoint is not configured to save the best model.'
ERR: pytorch_lightning.utilities.exceptions.MisconfigurationException: ckpt_path is "best", but ModelCheckpoint is not configured to save the best model.`

There is 1 checkpoint in the save directory at the correct time.

Does ModelCheckpoint need to be configured differently?

sshleifer on 7 Oct 2020

have you set a monitor key?
only with this will it be able to track best

awaelchli on 7 Oct 2020

yes, we removed the magical “val_loss” key for 1.0.

by default without a checkpoint CB we save the last always.

to change that you have to now pass in a callback with the monitor keyword.

it makes it less magical now and more transparent. also opens the door to multiple checkpoint callbacks soon

williamFalcon on 7 Oct 2020

Yes I have a checkpoint callback with a monitor

    checkpoint_callback = ModelCheckpoint(
        filepath=os.path.join(output_dir, exp),
        monitor=f"val_{metric}",
        mode="min" if "loss" in metric else "max",
        save_top_k=save_top_k,
        period=0,  # Interval (number of val checks) between checkpoints.

    )

It correctly saves checkpoints but trainer.test doesn't recognize it for some reason.

sshleifer on 7 Oct 2020

do you run ddp? how do I reproduce it?

awaelchli on 7 Oct 2020

@sshleifer can you reproduce using the BoringModel?

if you submit a failing test for this using BoringModel we can fix in a few mins

williamFalcon on 7 Oct 2020

Tried and gave up after 30 mins.
Had to guess install instructions deal (after scanning README) with failing horovod install. (I blindly ran pip install -r requirements/devel.txt) Then I tried to add to test_ddp and test_trainer but my test was failing for other reasons than the bug I was trying to isolate. (I'm not whining just telling you cause I would want you to tell me if the roles were reversed.)

Is there a test that uses:
DDP, BoringModel, and ModelCheckpoint and calls trainer.test at the end?

If that exists and passes, I won't be able to replicate my issue.
if it doesn't exist but passes: you increased the likelihood that my issue is my fault and improved your test coverage.

sshleifer on 7 Oct 2020

yup! this test

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/backends/test_ddp.py#L44-L57

EvalModelTemplate which is a bit more involved than BoringModel
DDP
modelcheckpoint
calls both fit and test at the end

And... to be extra safe ;) it even makes sure it gets 90+ acc

williamFalcon on 7 Oct 2020

@sshleifer I can confirm this works using the below boring model test:

import os

import torch
from torch.utils.data import Dataset

from pytorch_lightning import Trainer, LightningModule
from pytorch_lightning.callbacks import ModelCheckpoint


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        """
        Testing PL Module

        Use as follows:
        - subclass
        - modify the behavior for what you want

        class TestModel(BaseTestModel):
            def training_step(...):
                # do your own thing

        or:

        model = BaseTestModel()
        model.training_epoch_end = None

        """
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def loss(self, batch, prediction):
        # An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
        return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))

    def step(self, x):
        x = self.layer(x)
        out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
        return out

    def training_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('loss', loss)
        return {"loss": loss}

    def training_step_end(self, training_step_outputs):
        return training_step_outputs

    def validation_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('x', loss)

    def test_step(self, batch, batch_idx):
        output = self.layer(batch)
        loss = self.loss(batch, output)
        self.log('y', loss)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.layer.parameters(), lr=0.1)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
        return [optimizer], [lr_scheduler]


def run_test():
    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    # model
    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=4,
        limit_val_batches=2,
        max_epochs=1,
        accelerator='ddp',
        gpus=2,
        checkpoint_callback=ModelCheckpoint(
            monitor='x',
            mode='min',
            save_top_k=1
        )
    )
    trainer.fit(model, train_data, val_data)
    trainer.test()


if __name__ == '__main__':
    run_test()

I think the error message is a bit ambiguous, but the issue might be not calling self.log to report the metrics we'd like to save on.

SeanNaren on 19 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings