Pytorch-lightning: Mismatch of displayed 'epoch'

Created on 9 Jan 2020  Β·  9Comments  Β·  Source: PyTorchLightning/pytorch-lightning

πŸ› Bug

The display of epoch's number mismatches between the progress bar and the checkpoint indicator. I wonder this mismatch could confuse users.

  • progress bar: The number of epochs starts from 1.
  • checkpoint indicator: The number of epochs starts from 0.
  • metrics.csv also starts from 0.

I think that to change checkpoint and metrics.csv causes a serious problem.
So progress bar should be changed in my opinion.

What do you think about it?

Epoch 32: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 331/331 [00:05<00:00, 88.73batch/s, batch_idx=17, loss=1.148, train_batch_loss=1.02, v_num=0, val_loss=1.05]
AINFO:root:
Epoch 00031: val_loss reached 1.04545 (best 1.04545), saving model to /dummy/version_0/checkpoints/_ckpt_epoch_31.ckpt as top 1
{'loss': 1.022357702255249, 'train_batch_loss': 1.022357702255249, 'val_loss': 1.0454469919204712}
Epoch 33:   5%|β–Œ         | 18/331 [00:00<00:05, 61.06batch/s, batch_idx=17, loss=1.073, train_batch_loss=1.31, v_num=0, val_loss=1.05]

Environment

  • PyTorch Version : 1.3.1
  • OS : macOS 10.14.6
  • How you installed PyTorch : pip install git+https://github.com/williamFalcon/pytorch-lightning.git@master --upgrade
  • Python version : 3.7.3
  • use CPU
bug / fix good first issue

All 9 comments

Indeed, the inconsistency of zero-based/one-based epoch is very confusing.

I think we should use zero-based only.

@matthew-z i agree it should be zero based. want to submit a PR?
@onkyo14taro @matthew-z

@williamFalcon I'll try this weekend.

@matthew-z @neggert @williamFalcon @Borda

I found a 'one epoch'-based API in GradientAccumulationScheduler while I was fixing the code.
If we force the API and display to be zero-based, it will break backward compatibility.
However, I think that mixing 0 and 1 will confuse users than changing the API, it should be zero based.
I think it would be better we warn with FutureWarning in version 0.6.x, and then change the API and display to be zero-based in version 0.7.x.
What do you think about it?

I can't quite wrap my head around this. If you make it zero based, then when the progress bar shows
Epoch 1: 50/100%
it will no longer mean that the first epoch is in progress, but actually it is the second epoch 50% completed? Is this really what you want? Why should it be this way?

@awaelchli

I agree with you, but I think zero-based would be better than one-based.

Zero-based implementations are:

  • Progress bar (_console display_)
  • GradientAccumulationScheduler (_console display, API_)
  • EarlyStopping (_console display_)

One-based implementations are:

  • metrics.csv (_in the file_)
  • _ckpt_epoch_{0-based epoch number}.ckpt (_filename_)
  • ModelCheckpoint (_console display_)

I think what of most influensive in these items are "metrics.csv" and "_ckpt_epoch_{0-based epoch number}.ckpt."
That's because we usually use the result file rather than console display.
So I think that zero-based implementations have less confusion than one-based when breaking backward compatibility in order to unify representation of epoch numbers.

agreed. let’s do zero-based

@williamFalcon Is it ok that FutureWarning is thrown when Progress bar, GradientAccumulationScheduler or EarlyStopping are used in version 0.6.x , and then the API and display renew in version 0.7.0?

we are using DeprecatedWarning :]

Was this page helpful?
0 / 5 - 0 ratings