Pytorch-lightning: Improve tqdm progress bar

Created on 29 Jan 2020  ยท  29Comments  ยท  Source: PyTorchLightning/pytorch-lightning

At the moment the progress bar is initialized with the arg leave=False: https://github.com/PyTorchLightning/pytorch-lightning/blob/deffbaba7ffb16ff57b56fe65f62df761f25fbd6/pytorch_lightning/trainer/trainer.py#L861

Sometimes, it's nice to be able to see the previous progress bar to look at the evolution of the loss and metrics.

Would that be possible to add an arg to the trainer to be able to override default tqdm parameters?

Also, another point: tqdm progress bars can be nested (https://github.com/tqdm/tqdm#nested-progress-bars). Could we imagine having a global progress bar and then a nested progress bar for each epoch loop?

enhancement good first issue help wanted

Most helpful comment

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

All 29 comments

Another nice addition would be a global progress bar to have an ETA for the end of the whole training. Maybe a more general way to address this issue is to abstract the use of the progress bar in Trainer (with a callback system for example), so people can extend and tweak progress bar usage as they need.

@hadim sounds interesting, do you have any particular implementation in mind?
Would you mind to make a PR? =)

I think the progress bar should not be hardcoded in the trainer but abstracted in a callback. Once https://github.com/PyTorchLightning/pytorch-lightning/pull/776 is merged I could have a look if it's possible with the current API.

More generally the loggers should also be callbacks IMO. That being said it's easy to propose when you're not in charge :-)

I'll try to make a PR once #776 is merged.

@hadim are still interested in implementing this progress bar?

I've made a custom progress bar as a callback and it works well for my needs. Not sure it will fit everyone's needs.

from tqdm.auto import tqdm

import torch
from pytorch_lightning.callbacks import Callback


class ProgressBar(Callback):
    """Global progress bar.
    TODO: add progress bar for training, validation and testing loop.
    """

    def __init__(self, global_progress: bool = True, leave_global_progress: bool = True):
        super().__init__()

        self.global_progress = global_progress
        self.global_desc = "Epoch: {epoch}/{max_epoch}"
        self.leave_global_progress = leave_global_progress
        self.global_pb = None

    def on_fit_start(self, trainer, pl_module):
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)

        self.global_pb = tqdm(
            desc=desc,
            total=trainer.max_epochs,
            initial=trainer.current_epoch,
            leave=self.leave_global_progress,
            disable=not self.global_progress,
        )

    def on_fit_end(self, trainer, pl_module):
        self.global_pb.close()
        self.global_pb = None

    def on_epoch_end(self, trainer, pl_module):

        # Set description
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
        self.global_pb.set_description(desc)

        # Set logs and metrics
        logs = pl_module.logs
        for k, v in logs.items():
            if isinstance(v, torch.Tensor):
                logs[k] = v.squeeze().item()
        self.global_pb.set_postfix(logs)

        # Update progress
        self.global_pb.update(1)

Only a global progress bar is implemented at the moment.

I could make a PR but some people might prefer the original one so I don't know if it's worth it.

yeah it looks the much cleaner way that using the callback driven progress bar then checking the for loop wrapped by tqdm

May I also add that I find the tqdm progress bar starting weirdly, with a percentage equal with 6% just after a single batch. And the progress bar shows final value of 790, but if I am to calculate it by hand an epoch either has 528 or 1056 (either one pass or one forward, one backward).

the bar shows the sum of train + val

the bar shows the sum of train + val

Sorry, I do not follow, I was referring to the progress counter being of, like after a single batch it shows:

Epoch 1: 6%|โ–‹ | 50/790 [00:09<02:19, 5.29it/s, loss=3623526.000, training_loss=3.62e+6, v_num=0]0.0
When the batch size is 4 and neither my training, validation or training+validation sets have 790 batches.

50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.

you can change that argument from 50 to 1 (bar refresh rate)

@hadim i think abstracting the current progress bar into a callback would be cool. then as you said, the user can modify it however they want by overriding parts of the callback.

50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.

you can change that argument from 50 to 1 (bar refresh rate)

Yes, but that jump to 50 happens after only 1 batch. Shouldn't it stay at 0 until batch no 50?

@williamFalcon: I agree this should be done in a callback. Not sure I'll have time to do that in the short term but anyone is free to use my code above.

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

I was observing something similar in other projects and it is hard to determine, sometimes id cased by debug mode (eg in PyCharm)... but this it TQDM related thing, I think that we can't do anything about it... :[

@hadim still willing to implement https://github.com/PyTorchLightning/pytorch-lightning/issues/765#issuecomment-593703168 ?
@danieltudosiu default was changed in #1100
@mateuszpieniak it is TQDM issue, we cannot do much about it...
also, the TQM default was changed in #749

Sorry @Borda but this is not a good moment for me to do that.

@awaelchli may you self-assign also this one as they are almost the same...

@Borda yes, could you assign me (can't self-assign) :)

The progress bar is now a callback #1450 . What remains is the question whether there should be an additional global progress bar (as suggested by @hadim) or if it is left to the user to extend such a feature.

@awaelchli I would assume to be closed by #1450 and if we find we need something else we will all it later... anyway feel to reopen you we are (I am) missing something :rabbit:

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ– | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

Any suggestions on how to resolve this?

In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.

In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.

I ran it on zsh and bash. tqdm==4.48.2, pytorch-lightning==1.0.0

I am seeing this behavior in jupyterlab as well:

Epoch 1:  54%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 4271/7859 [04:08<03:33, 16.81it/s, loss=0.545, v_num=0]
Epoch 1:  55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 4287/7859 [04:08<03:31, 16.87it/s, loss=0.545, v_num=0]
Epoch 1:  55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 4303/7859 [04:09<03:30, 16.89it/s, loss=0.545, v_num=0]
Validating:   7%|โ–‹         | 258/3809 [00:02<01:12, 49.27it/s]
Epoch 1:  55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 4319/7859 [04:10<03:29, 16.90it/s, loss=0.545, v_num=0]
Validating:   7%|โ–‹         | 274/3809 [00:03<02:08, 27.59it/s]
Validating:   7%|โ–‹         | 280/3809 [00:03<02:06, 27.90it/s]
Epoch 1:  55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ    | 4335/7859 [04:10<03:28, 16.92it/s, loss=0.545, v_num=0]

The progress bar seems to work well when testing, in trainer.test(model, dm), also tuning lr shows correct progress bar, but not when fitting. Any known fix for jupyterlab?

It's because of the stacking. progress bar stacking has never worked well in jupyter and google colab. As far as we know, it's a tqdm issue. Try running a stacked tqdm progress bar (without Lightning) in a Jupyter and you will see the same.

In method init_validation_tqdm in line 289 in file pytorch_lightning/callbacks/progress.py, there is leave=False. Shouldn't it be leave=True? It is True in the train and test init tqdm methods.

Got the idea from here.

If we set it to leave=True, it will stay and fill up the terminal. But we want it to go away once validation is over because it's only a temporary bar that runs in parallel with the main bar. The main bar should stay always because it shows the epoch counter for the whole training.

Maybe I'm missing something. Appreciate you trying to look for the fix.

I ran the following code to test if the setting leave=True solved the problem (but it didn't):

from tqdm.auto import tqdm
import sys

class LitProgressBar(ProgressBar):

    def init_validation_tqdm(self):
        """ Override this to customize the tqdm bar for validation. """
        bar = tqdm(
            desc='Validating',
            position=(2 * self.process_position + 1),
            disable=self.is_disabled,
            leave=True,
            dynamic_ncols=True,
            file=sys.stdout
        )
        return bar

I then ran my model with the custom callback, and after a few steps (~50% epoch) the screen was packed again with multiple printed lines :(

As a temporary fix I will disable the validation progress bar with a custom callback, at least when running with Jupyter. Thanks for the help!

Was this page helpful?
0 / 5 - 0 ratings