At the moment the progress bar is initialized with the arg leave=False
: https://github.com/PyTorchLightning/pytorch-lightning/blob/deffbaba7ffb16ff57b56fe65f62df761f25fbd6/pytorch_lightning/trainer/trainer.py#L861
Sometimes, it's nice to be able to see the previous progress bar to look at the evolution of the loss and metrics.
Would that be possible to add an arg to the trainer to be able to override default tqdm parameters?
Also, another point: tqdm progress bars can be nested (https://github.com/tqdm/tqdm#nested-progress-bars). Could we imagine having a global progress bar and then a nested progress bar for each epoch loop?
Another nice addition would be a global progress bar to have an ETA for the end of the whole training. Maybe a more general way to address this issue is to abstract the use of the progress bar in Trainer
(with a callback system for example), so people can extend and tweak progress bar usage as they need.
@hadim sounds interesting, do you have any particular implementation in mind?
Would you mind to make a PR? =)
I think the progress bar should not be hardcoded in the trainer but abstracted in a callback. Once https://github.com/PyTorchLightning/pytorch-lightning/pull/776 is merged I could have a look if it's possible with the current API.
More generally the loggers should also be callbacks IMO. That being said it's easy to propose when you're not in charge :-)
I'll try to make a PR once #776 is merged.
@hadim are still interested in implementing this progress bar?
I've made a custom progress bar as a callback and it works well for my needs. Not sure it will fit everyone's needs.
from tqdm.auto import tqdm
import torch
from pytorch_lightning.callbacks import Callback
class ProgressBar(Callback):
"""Global progress bar.
TODO: add progress bar for training, validation and testing loop.
"""
def __init__(self, global_progress: bool = True, leave_global_progress: bool = True):
super().__init__()
self.global_progress = global_progress
self.global_desc = "Epoch: {epoch}/{max_epoch}"
self.leave_global_progress = leave_global_progress
self.global_pb = None
def on_fit_start(self, trainer, pl_module):
desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
self.global_pb = tqdm(
desc=desc,
total=trainer.max_epochs,
initial=trainer.current_epoch,
leave=self.leave_global_progress,
disable=not self.global_progress,
)
def on_fit_end(self, trainer, pl_module):
self.global_pb.close()
self.global_pb = None
def on_epoch_end(self, trainer, pl_module):
# Set description
desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
self.global_pb.set_description(desc)
# Set logs and metrics
logs = pl_module.logs
for k, v in logs.items():
if isinstance(v, torch.Tensor):
logs[k] = v.squeeze().item()
self.global_pb.set_postfix(logs)
# Update progress
self.global_pb.update(1)
Only a global progress bar is implemented at the moment.
I could make a PR but some people might prefer the original one so I don't know if it's worth it.
yeah it looks the much cleaner way that using the callback driven progress bar then checking the for loop wrapped by tqdm
May I also add that I find the tqdm progress bar starting weirdly, with a percentage equal with 6% just after a single batch. And the progress bar shows final value of 790, but if I am to calculate it by hand an epoch either has 528 or 1056 (either one pass or one forward, one backward).
the bar shows the sum of train + val
the bar shows the sum of train + val
Sorry, I do not follow, I was referring to the progress counter being of, like after a single batch it shows:
Epoch 1: 6%|โ | 50/790 [00:09<02:19, 5.29it/s, loss=3623526.000, training_loss=3.62e+6, v_num=0]0.0
When the batch size is 4 and neither my training, validation or training+validation sets have 790 batches.
50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.
you can change that argument from 50 to 1 (bar refresh rate)
@hadim i think abstracting the current progress bar into a callback would be cool. then as you said, the user can modify it however they want by overriding parts of the callback.
50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.you can change that argument from 50 to 1 (bar refresh rate)
Yes, but that jump to 50 happens after only 1 batch. Shouldn't it stay at 0 until batch no 50?
@williamFalcon: I agree this should be done in a callback. Not sure I'll have time to do that in the short term but anyone is free to use my code above.
A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.
Epoch 9: 79%|โโโโโโโโ | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating: 0%| | 0/179 [00:00<?, ?batch/s]
Epoch 9: 81%|โโโโโโโโโ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 83%|โโโโโโโโโ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 85%|โโโโโโโโโ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 87%|โโโโโโโโโ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 89%|โโโโโโโโโ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 91%|โโโโโโโโโ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 93%|โโโโโโโโโโ| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 95%|โโโโโโโโโโ| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 97%|โโโโโโโโโโ| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|โโโโโโโโโโ| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
I was observing something similar in other projects and it is hard to determine, sometimes id cased by debug mode (eg in PyCharm)... but this it TQDM related thing, I think that we can't do anything about it... :[
@hadim still willing to implement https://github.com/PyTorchLightning/pytorch-lightning/issues/765#issuecomment-593703168 ?
@danieltudosiu default was changed in #1100
@mateuszpieniak it is TQDM issue, we cannot do much about it...
also, the TQM default was changed in #749
Sorry @Borda but this is not a good moment for me to do that.
@awaelchli may you self-assign also this one as they are almost the same...
@Borda yes, could you assign me (can't self-assign) :)
The progress bar is now a callback #1450 . What remains is the question whether there should be an additional global progress bar (as suggested by @hadim) or if it is left to the user to extend such a feature.
@awaelchli I would assume to be closed by #1450 and if we find we need something else we will all it later... anyway feel to reopen you we are (I am) missing something :rabbit:
A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.
Epoch 9: 79%|โโโโโโโโ | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Validating: 0%| | 0/179 [00:00<?, ?batch/s] Epoch 9: 81%|โโโโโโโโโ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 83%|โโโโโโโโโ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 85%|โโโโโโโโโ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 87%|โโโโโโโโโ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 89%|โโโโโโโโโ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 91%|โโโโโโโโโ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 93%|โโโโโโโโโโ| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 95%|โโโโโโโโโโ| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 97%|โโโโโโโโโโ| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113] Epoch 9: 100%|โโโโโโโโโโ| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Any suggestions on how to resolve this?
In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.
In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.
I ran it on zsh and bash. tqdm==4.48.2, pytorch-lightning==1.0.0
I am seeing this behavior in jupyterlab as well:
Epoch 1: 54%|โโโโโโ | 4271/7859 [04:08<03:33, 16.81it/s, loss=0.545, v_num=0]
Epoch 1: 55%|โโโโโโ | 4287/7859 [04:08<03:31, 16.87it/s, loss=0.545, v_num=0]
Epoch 1: 55%|โโโโโโ | 4303/7859 [04:09<03:30, 16.89it/s, loss=0.545, v_num=0]
Validating: 7%|โ | 258/3809 [00:02<01:12, 49.27it/s]
Epoch 1: 55%|โโโโโโ | 4319/7859 [04:10<03:29, 16.90it/s, loss=0.545, v_num=0]
Validating: 7%|โ | 274/3809 [00:03<02:08, 27.59it/s]
Validating: 7%|โ | 280/3809 [00:03<02:06, 27.90it/s]
Epoch 1: 55%|โโโโโโ | 4335/7859 [04:10<03:28, 16.92it/s, loss=0.545, v_num=0]
The progress bar seems to work well when testing, in trainer.test(model, dm)
, also tuning lr shows correct progress bar, but not when fitting. Any known fix for jupyterlab?
It's because of the stacking. progress bar stacking has never worked well in jupyter and google colab. As far as we know, it's a tqdm issue. Try running a stacked tqdm progress bar (without Lightning) in a Jupyter and you will see the same.
In method init_validation_tqdm
in line 289 in file pytorch_lightning/callbacks/progress.py
, there is leave=False
. Shouldn't it be leave=True
? It is True
in the train and test init tqdm methods.
Got the idea from here.
If we set it to leave=True, it will stay and fill up the terminal. But we want it to go away once validation is over because it's only a temporary bar that runs in parallel with the main bar. The main bar should stay always because it shows the epoch counter for the whole training.
Maybe I'm missing something. Appreciate you trying to look for the fix.
I ran the following code to test if the setting leave=True
solved the problem (but it didn't):
from tqdm.auto import tqdm
import sys
class LitProgressBar(ProgressBar):
def init_validation_tqdm(self):
""" Override this to customize the tqdm bar for validation. """
bar = tqdm(
desc='Validating',
position=(2 * self.process_position + 1),
disable=self.is_disabled,
leave=True,
dynamic_ncols=True,
file=sys.stdout
)
return bar
I then ran my model with the custom callback, and after a few steps (~50% epoch) the screen was packed again with multiple printed lines :(
As a temporary fix I will disable the validation progress bar with a custom callback, at least when running with Jupyter. Thanks for the help!
Most helpful comment
A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.