Ignite: Improve State and Engine docs

Created on 31 Aug 2020 · 12Comments · Source: pytorch/ignite

🚀 Feature

Currently, if we execute twice trainer.run we can fall into the following situation:

# ...
trainer.run(train_loader, max_epochs=5)

# reset model weights etc and rerun
trainer.run(train_loader, max_epochs=2)
> ValueError: Argument max_epochs should be larger than the start epoch defined in the state: 2 vs 5

But if we would like to restart the training from the begining, we have to reset trainer.state as

trainer.run(train_loader, max_epochs=5)

# reset model weights etc and rerun
trainer.state.max_epochs = None
trainer.run(train_loader, max_epochs=2)

Let's improve two things:
1) let's provide State.restart() method that just sets state.max_epochs = None
2) Add a note into Engine.run docstring saying if we would like to restart the training from zero, we have to either call trainer.state.restart() or trainer.state.max_epochs = None.
3) Improve the message "ValueError: Argument max_epochs should be larger than the start epoch defined in the state: 2 vs 5" by adding something like "Please, call state.restart() before calling engine.run() if would like to restart from zero."

For Hacktoberfest contributors, feel free to ask questions for details if any and say that you would like to tackle the issue.
Please, take a look at CONTRIBUTING guide.

Hacktoberfest enhancement good first issue help wanted

Source

vfdev-5

👍1

Most helpful comment

Just a remark, having a restart() method lets user call it when he wants, even if it's not the good time. Why do not have such feature in the trainer ? No way to misuse it. What do you think ?
@vfdev-5 @fco-dv

sdesrozis on 29 Sep 2020

👍2

All 12 comments

@vfdev-5 I could give some help on this one!

fco-dv on 28 Sep 2020

❤1

@fco-dv thanks, go ahead !

vfdev-5 on 28 Sep 2020

👍1

sdesrozis on 29 Sep 2020

👍2

@sdesrozis I agree that exposing such method can lead to a potential misuse and bugs. Adding Engine.restart method can have the same problems.
I let you discuss what could be an appropriate way to handle the actual problem.

vfdev-5 on 29 Sep 2020

@sdesrozis I agree, there's maybe a safer way to do that. When you said " even if it's not the good time " , can you figure out a use case ?

fco-dv on 29 Sep 2020

For instance

@trainer.on(Events.ITERATION_COMPLETED)
def _():
    trainer.state.restart()

By writing that, I realized that it could be a very nice feature to be able to restart or invalidate epochs if something wrong happened during learning. For example
epoch 10 started
=> save weights and others parameters
=> do something (maybe change lr, or what you want)
run several epochs
=> if something wrong, restart to epoch 10 and rerun with new parametrization

That is out of the scope but if the api is too restrictive (as I thought), no way to do that. So let's code the State.restart() and see.

Sorry for the noise...

sdesrozis on 29 Sep 2020

@sdesrozis I think calling trainer.state.restart() during the training can be dangerous and lead to unhandled/buggy situations. And it is a bit different from

@trainer.on(Events.ITERATION_COMPLETED)
def _():
    trainer.state.max_epochs = None

I think we shouldn't provide restart-like methods...

If we would like to devalidate some epoch etc. We can simply set the number as required (even it is very dynamic as approach):

@trainer.on(Events.EPOCH_COMPLETED)
def _():
    if some_condition():
        revert_n_epochs()
        trainer.state.epoch -= n

vfdev-5 on 29 Sep 2020

👍1

I think we shouldn't provide restart-like methods...

Ok that was my first felling too...

Maybe something like

trainer.run(train_loader, max_epochs=2, restart=True)

@vfdev-5 @fco-dv Thoughts ?

sdesrozis on 29 Sep 2020

@sdesrozis looks better for sure

fco-dv on 29 Sep 2020

I'm not fan of introducing new arg like restart ...

vfdev-5 on 29 Sep 2020

👍1

@vfdev-5 I understand.
@fco-dv Do you have a better idea ? Otherwise, let's implement trainer.state.restart() and we will see.

sdesrozis on 29 Sep 2020

@sdesrozis for the moment I don't

fco-dv on 29 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Accumulate output of .run()

andreydung · 4Comments

Label-wise metrics (Accuracy etc.) for multi-label problems

jphdotam · 4Comments

How do I attach both train and validation metrics to evaluator engine?

milongo · 3Comments

PyTorch dependency is lacking version constraint

sisp · 3Comments

Add mAP metric for object detection

TheCodez · 3Comments