Pytorch-lightning: Validation loss in progress bar printed line by line

Created on 8 Oct 2019  路  14Comments  路  Source: PyTorchLightning/pytorch-lightning

Common bugs:

checked.

Describe the bug
When adding a "progress_bar" key to the validation_end output, the progress bar doesn't behave as expected and prints one line per iteration, eg:

80%|8| 3014/3750 [00:23<00:01, 516.63it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
82%|8| 3066/3750 [00:23<00:01, 517.40it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
83%|8| 3118/3750 [00:23<00:01, 516.65it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 85%|8| 3170/3750 [00:23<00:01, 517.42it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
86%|8| 3222/3750 [00:23<00:01, 517.59it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
87%|8| 3274/3750 [00:23<00:00, 518.00it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
89%|8| 3326/3750 [00:23<00:00, 518.16it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
90%|9| 3378/3750 [00:23<00:00, 518.45it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
91%|9| 3430/3750 [00:23<00:00, 518.36it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
93%|9| 3482/3750 [00:23<00:00, 518.02it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
94%|9| 3534/3750 [00:24<00:00, 517.26it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
96%|9| 3586/3750 [00:24<00:00, 517.68it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
97%|9| 3638/3750 [00:24<00:00, 518.08it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
98%|9| 3690/3750 [00:24<00:00, 518.18it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
100%|9| 3742/3750 [00:24<00:00, 518.23it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
100%|#| 3750/3750 [00:24<00:00, 518.23it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_loss=1.16]
save callback...
100%|#| 3750/3750 [00:24<00:00, 152.16it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_loss=1.16]

To Reproduce
Steps to reproduce the behavior:

  1. Take MNIST script minimal example (https://williamfalcon.github.io/pytorch-lightning/LightningModule/RequiredTrainerInterface/)
    with some code to run it
if __name__ == "__main__":
    model = CoolModel()

    # most basic trainer, uses good defaults
    default_save_path = '/tmp/checkpoints/'
    trainer = pl.Trainer(default_save_path=default_save_path,
                         show_progress_bar=True)
    trainer.fit(model)
  1. Change validation_end method to:
    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tqdm_dict = {'val_loss': avg_loss}

        return {
                'progress_bar': tqdm_dict,
                'log': {'val_loss': avg_loss},
        }
  1. Change training_step to:
    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        output = {
            'loss': loss,  # required
            'progress_bar': {'training_loss': loss},  # optional (MUST ALL BE TENSORS)
        }
        return output
  1. Run the script, see error at validation time.

Note that both steps 2 and 3 are necessary to reproduce the issue, each separately would run as expected.

Expected behavior
A progress bar on a single line.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Linux
  • Version
    pytorch-lightning==0.5.1.3
    torch==1.2.0

Additional context
Actually I ran into this issue after trying to add EarlyStopping, which asked for val_loss, which I found out was to be added via the progress_bar metrics... which was quite unexpected for me (I would have had it in "log" or direct key?)

bug / fix enhancement help wanted

Most helpful comment

One option is to use from tqdm.auto import tqdm this way it will use Ipython widgets when in the notebook.

All 14 comments

I just checked if that was the version of tqdm by upgrading from tqdm==4.35.0 to tqdm==4.36.1, to no avail.

@annemariet thanks for finding this. Are you using this in jupyter notebook? that might be the issue.

But on a usability note, we'll move the early stopping to use keys not in progress_bar or log. Good point!

Hi, thanks for your quick reply. I'm running this from command line.

@annemariet this can happen if you resize your terminal window during training. this is a tqdm bug, not PL bug.

Re the earlystopping, i just sent a fix yesterday where any of the keys NOT in "progress_bar" or "log" will be used for all callbacks. This is on master now.

I can reopen this if you are still having issues

This actually still happens in Spyder, but works fine in terminal.

pytorch-lightning==0.5.3.2
spyder==3.3.6
spyder-kernels==0.5.2
tqdm==4.38.0
ipython==7.9.0

Sorry, although I searched for it I had not seen it was already discussed here. I think its still an important open issue.

https://github.com/PyTorchLightning/pytorch-lightning/issues/721

I am using Jupyter notebook and this happens in there. Is there a fix for it in Jupyter notebook? I like to develop there before moving to the command line.

@sudarshan85 it is issue of TQDM, not lightning, we cannot do much about it, try to upgrade...

@Borda nevertheless we could think of a solution to disable the val-progress bar individually, or otherwise give flexibility

I'm curious whether something like fastpgross could be included in Lightning. There is also tqdm_notebook. I wonder whether this can be passed into Lightning for using as progress bar.

One option is to use from tqdm.auto import tqdm this way it will use Ipython widgets when in the notebook.

is it just like it, import another tqdm class? Would you consider making a PR?

Sure, it's PR #752. There will be some edge cases where someone is in a notebook environment but doesn't have their widgets set up. In that case they will get a warning message about what to do and their progress bar wont show.

So could potentially add a training parameter to override this, which may help those people.

FYI: this is how tqdm does the notebook detection

I will close this in favour of #765 so pls let's continue the discussion there... :robot:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

williamFalcon picture williamFalcon  路  3Comments

awaelchli picture awaelchli  路  3Comments

mmsamiei picture mmsamiei  路  3Comments

williamFalcon picture williamFalcon  路  3Comments

DavidRuhe picture DavidRuhe  路  3Comments