Pytorch-lightning: Progress bar \ log dict items added to outputs in training_epoch_end

Created on 4 May 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

When running with training_step returning something along this:

return { 'loss':x,
            'some_value_for_epoch_end': y,
            'progress_bar':{'v1':z.mean()}}

you get 'v1' as part of the outputs in epoch end. This is unexpected imo.

Also in case it is:

return { 'loss':x,
            'some_value_for_epoch_end': y,
            'progress_bar':{'some_value_for_epoch_end': z.mean()}}

you get 'some_value_for_epoch_end' = z.mean(), as the values get overwritten.

It originate from file trainer/logging.py
lines 172, 173, where you overwrite the values using progress_bar_outputs + log_outputs

bug / fix help wanted

Source

kessido

👍1

Most helpful comment

hey awesome, I will review PR as soon as I have time! Thanks for your effort

awaelchli on 12 May 2020

❤2

All 15 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 4 May 2020

Isn't it a feature?
https://pytorch-lightning.readthedocs.io/en/0.7.5/experiment_reporting.html#modify-progress-bar

I would await to see 'v1': SomeValue at the progress bar output after what I see in the docs.
Or do you mean you wan't only to see the value without 'v1'?

olineumann on 4 May 2020

Might be, but I looked at the progress bar, and log, as something that goes to the "back end".

The "overwriting" property is a but I believe.
Lets say someone is trying to aggregate "acc" for the training_epoch_end phases, but he also like to report "acc" in the progress bar.

"acc1" is of size [batch_size, 1], while "acc2" is of size [1], just a scalar.

When arriving to training_epoch_end, he well see all outputs have only the [1] shaped outputs, instead of the [batch_size, 1] ones.

kessido on 4 May 2020

Ah okay. Now I understand. Do you have runnable example code or notebook for that issue?
The logging class returning each dict but also a complete callback_metric dict. See:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/logging.py#L171-L179

So I think you could access all metric dicts by trainer.some_dict (progress_bar_metrics, callback_metrics). The question is I think if there is any value overwritten and gone forever. Currently I think it isn't.

EDIT:
I think that returning the following dict would overwrite it without getting it back:

{
    'loss': loss,
    'value': 1,
    'progress_bar': { 'value': 2}
}

where callback_metrics['value'] == 2. Same when passing 'value' to logs. Don't know if this case should be handled but I think it could lead to bugs. Some ideas how it could be solved:

Do not collect log and progress bar metrics in callback_metrics (but in the comments there is written it should...)
Leave the log and progress_bar logs in different sub dicts (like callback_metrics['progress_bar']['value']. Don't know if this would lead to other errors.
Give it priorities and add it to the docs. Like first prior would have the callback_metric, then logs, then progress_bar (like {'value': 1, 'progress_bar': {'value': 3}, 'logs': {'value': 2}})
?

Think this should be discussed more.

olineumann on 4 May 2020

Hey here is an example:
https://colab.research.google.com/drive/1kYjEH2h5Ki9ygZ0XlA6hpbcxA1L1NNzb?usp=sharing

IMO training output should contain only values that will be used at the epoch end step, and not contain progress_bar \ log values, so this is a valid solution. At least I expected it to be this way and was considerably surprised when it didn't.

kessido on 4 May 2020

@kessido Good example! This could be normal usage. I would wait for other opinions and ideas. Maybe I try solution 2 above the next days and check which tests will fail or which classes will be affected. But I think this could be a little bit harder.

Another quick solution would be to append a prefix to log and progress_bar dict values. This wouldn't affect other classes I think but could lead also to collisions.

Let's see what others will say or what ideas they have.

olineumann on 5 May 2020

😄1

I agree, the log and progress bar dict should not end up in epoch_end. If one really wants to collect a metric from these subdicts in epoch end, then one should simply store a copy of it in the top-level dict.

Unrelated, but noticed in your colab: You have a acc key in training_step and an acc_key in epoch end for the logger. This is not good, since this will affect the global_step in the logger and increase it by 1 after every epoch. You should rename it to "epoch_acc" or something like that.

awaelchli on 6 May 2020

❤1

I agree, the log and progress bar dict should not end up in epoch_end. If one really wants to collect a metric from these subdicts in epoch end, then one should simply store a copy of it in the top-level dict.

Unrelated, but noticed in your colab: You have a acc key in training_step and an acc_key in epoch end for the logger. This is not good, since this will affect the global_step in the logger and increase it by 1 after every epoch. You should rename it to "epoch_acc" or something like that.

Haha, nice catch! 😆
I usually call them with "train_x" or "val_x"~ but I just wrote it as an example :)

kessido on 6 May 2020

If no one else responding I'll add a PR for that. The simplest solution I tested was just remove adding log and progress bar dict to callback metric. I had to change the base model validation step in tests to pass.

olineumann on 12 May 2020

❤1

hey awesome, I will review PR as soon as I have time! Thanks for your effort

awaelchli on 12 May 2020

❤2

@awaelchli are you preparing another pr then the ref #1800?

Borda on 12 May 2020

No, if @olineumann's pr addresses the issue, it should not be necessary

awaelchli on 13 May 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.