Pytorch-lightning: Improve tqdm progress bar

Created on 29 Jan 2020 · 29Comments · Source: PyTorchLightning/pytorch-lightning

At the moment the progress bar is initialized with the arg leave=False: https://github.com/PyTorchLightning/pytorch-lightning/blob/deffbaba7ffb16ff57b56fe65f62df761f25fbd6/pytorch_lightning/trainer/trainer.py#L861

Sometimes, it's nice to be able to see the previous progress bar to look at the evolution of the loss and metrics.

Would that be possible to add an arg to the trainer to be able to override default tqdm parameters?

Also, another point: tqdm progress bars can be nested (https://github.com/tqdm/tqdm#nested-progress-bars). Could we imagine having a global progress bar and then a nested progress bar for each epoch loop?

enhancement good first issue help wanted

Source

hadim

👍4

Most helpful comment

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|███████▉  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|████████▏ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|████████▎ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|████████▌ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|████████▋ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|████████▉ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|█████████ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|█████████▎| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|█████████▍| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|█████████▋| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|██████████| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

mateuszpieniak on 25 Mar 2020

👍5

All 29 comments

Another nice addition would be a global progress bar to have an ETA for the end of the whole training. Maybe a more general way to address this issue is to abstract the use of the progress bar in Trainer (with a callback system for example), so people can extend and tweak progress bar usage as they need.

hadim on 8 Feb 2020

👍3

@hadim sounds interesting, do you have any particular implementation in mind?
Would you mind to make a PR? =)

Borda on 11 Feb 2020

I think the progress bar should not be hardcoded in the trainer but abstracted in a callback. Once https://github.com/PyTorchLightning/pytorch-lightning/pull/776 is merged I could have a look if it's possible with the current API.

More generally the loggers should also be callbacks IMO. That being said it's easy to propose when you're not in charge :-)

I'll try to make a PR once #776 is merged.

hadim on 11 Feb 2020

👍1

@hadim are still interested in implementing this progress bar?

Borda on 3 Mar 2020

I've made a custom progress bar as a callback and it works well for my needs. Not sure it will fit everyone's needs.

from tqdm.auto import tqdm

import torch
from pytorch_lightning.callbacks import Callback


class ProgressBar(Callback):
    """Global progress bar.
    TODO: add progress bar for training, validation and testing loop.
    """

    def __init__(self, global_progress: bool = True, leave_global_progress: bool = True):
        super().__init__()

        self.global_progress = global_progress
        self.global_desc = "Epoch: {epoch}/{max_epoch}"
        self.leave_global_progress = leave_global_progress
        self.global_pb = None

    def on_fit_start(self, trainer, pl_module):
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)

        self.global_pb = tqdm(
            desc=desc,
            total=trainer.max_epochs,
            initial=trainer.current_epoch,
            leave=self.leave_global_progress,
            disable=not self.global_progress,
        )

    def on_fit_end(self, trainer, pl_module):
        self.global_pb.close()
        self.global_pb = None

    def on_epoch_end(self, trainer, pl_module):

        # Set description
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
        self.global_pb.set_description(desc)

        # Set logs and metrics
        logs = pl_module.logs
        for k, v in logs.items():
            if isinstance(v, torch.Tensor):
                logs[k] = v.squeeze().item()
        self.global_pb.set_postfix(logs)

        # Update progress
        self.global_pb.update(1)

Only a global progress bar is implemented at the moment.

I could make a PR but some people might prefer the original one so I don't know if it's worth it.

hadim on 3 Mar 2020

👍4 😄1

yeah it looks the much cleaner way that using the callback driven progress bar then checking the for loop wrapped by tqdm

Borda on 3 Mar 2020

May I also add that I find the tqdm progress bar starting weirdly, with a percentage equal with 6% just after a single batch. And the progress bar shows final value of 790, but if I am to calculate it by hand an epoch either has 528 or 1056 (either one pass or one forward, one backward).

danieltudosiu on 10 Mar 2020

the bar shows the sum of train + val

williamFalcon on 10 Mar 2020

the bar shows the sum of train + val

Sorry, I do not follow, I was referring to the progress counter being of, like after a single batch it shows:

Epoch 1: 6%|▋ | 50/790 [00:09<02:19, 5.29it/s, loss=3623526.000, training_loss=3.62e+6, v_num=0]0.0
When the batch size is 4 and neither my training, validation or training+validation sets have 790 batches.

danieltudosiu on 10 Mar 2020

50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.

you can change that argument from 50 to 1 (bar refresh rate)

williamFalcon on 10 Mar 2020

@hadim i think abstracting the current progress bar into a callback would be cool. then as you said, the user can modify it however they want by overriding parts of the callback.

williamFalcon on 10 Mar 2020

50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.

you can change that argument from 50 to 1 (bar refresh rate)

Yes, but that jump to 50 happens after only 1 batch. Shouldn't it stay at 0 until batch no 50?

danieltudosiu on 10 Mar 2020

@williamFalcon: I agree this should be done in a callback. Not sure I'll have time to do that in the short term but anyone is free to use my code above.

hadim on 10 Mar 2020

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|███████▉  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|████████▏ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|████████▎ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|████████▌ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|████████▋ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|████████▉ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|█████████ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|█████████▎| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|█████████▍| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|█████████▋| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|██████████| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

mateuszpieniak on 25 Mar 2020

👍5

I was observing something similar in other projects and it is hard to determine, sometimes id cased by debug mode (eg in PyCharm)... but this it TQDM related thing, I think that we can't do anything about it... :[

Borda on 25 Mar 2020

@hadim still willing to implement https://github.com/PyTorchLightning/pytorch-lightning/issues/765#issuecomment-593703168 ?
@danieltudosiu default was changed in #1100
@mateuszpieniak it is TQDM issue, we cannot do much about it...
also, the TQM default was changed in #749

Borda on 26 Mar 2020

Sorry @Borda but this is not a good moment for me to do that.

hadim on 26 Mar 2020

@awaelchli may you self-assign also this one as they are almost the same...

Borda on 8 Apr 2020

@Borda yes, could you assign me (can't self-assign) :)

awaelchli on 8 Apr 2020

The progress bar is now a callback #1450 . What remains is the question whether there should be an additional global progress bar (as suggested by @hadim) or if it is left to the user to extend such a feature.

awaelchli on 24 Apr 2020

@awaelchli I would assume to be closed by #1450 and if we find we need something else we will all it later... anyway feel to reopen you we are (I am) missing something :rabbit:

Borda on 13 May 2020

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|███████▉  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|████████▏ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|████████▎ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|████████▌ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|████████▋ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|████████▉ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|█████████ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|█████████▎| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|█████████▍| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|█████████▋| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|██████████| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

Any suggestions on how to resolve this?

achinta on 30 Oct 2020

In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.

awaelchli on 30 Oct 2020

In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.

I ran it on zsh and bash. tqdm==4.48.2, pytorch-lightning==1.0.0

achinta on 30 Oct 2020

I am seeing this behavior in jupyterlab as well:

Epoch 1:  54%|█████▍    | 4271/7859 [04:08<03:33, 16.81it/s, loss=0.545, v_num=0]
Epoch 1:  55%|█████▍    | 4287/7859 [04:08<03:31, 16.87it/s, loss=0.545, v_num=0]
Epoch 1:  55%|█████▍    | 4303/7859 [04:09<03:30, 16.89it/s, loss=0.545, v_num=0]
Validating:   7%|▋         | 258/3809 [00:02<01:12, 49.27it/s]
Epoch 1:  55%|█████▍    | 4319/7859 [04:10<03:29, 16.90it/s, loss=0.545, v_num=0]
Validating:   7%|▋         | 274/3809 [00:03<02:08, 27.59it/s]
Validating:   7%|▋         | 280/3809 [00:03<02:06, 27.90it/s]
Epoch 1:  55%|█████▌    | 4335/7859 [04:10<03:28, 16.92it/s, loss=0.545, v_num=0]

The progress bar seems to work well when testing, in trainer.test(model, dm), also tuning lr shows correct progress bar, but not when fitting. Any known fix for jupyterlab?

jzazo on 23 Nov 2020

It's because of the stacking. progress bar stacking has never worked well in jupyter and google colab. As far as we know, it's a tqdm issue. Try running a stacked tqdm progress bar (without Lightning) in a Jupyter and you will see the same.

awaelchli on 23 Nov 2020

In method init_validation_tqdm in line 289 in file pytorch_lightning/callbacks/progress.py, there is leave=False. Shouldn't it be leave=True? It is True in the train and test init tqdm methods.

Got the idea from here.

jzazo on 24 Nov 2020

If we set it to leave=True, it will stay and fill up the terminal. But we want it to go away once validation is over because it's only a temporary bar that runs in parallel with the main bar. The main bar should stay always because it shows the epoch counter for the whole training.

Maybe I'm missing something. Appreciate you trying to look for the fix.

awaelchli on 24 Nov 2020

I ran the following code to test if the setting leave=True solved the problem (but it didn't):

from tqdm.auto import tqdm
import sys

class LitProgressBar(ProgressBar):

    def init_validation_tqdm(self):
        """ Override this to customize the tqdm bar for validation. """
        bar = tqdm(
            desc='Validating',
            position=(2 * self.process_position + 1),
            disable=self.is_disabled,
            leave=True,
            dynamic_ncols=True,
            file=sys.stdout
        )
        return bar

I then ran my model with the custom callback, and after a few steps (~50% epoch) the screen was packed again with multiple printed lines :(

As a temporary fix I will disable the validation progress bar with a custom callback, at least when running with Jupyter. Thanks for the help!

jzazo on 24 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings