Transformers: Need a Restore training mechenisim in run_lm_finetuning.py

Created on 23 Nov 2019  路  3Comments  路  Source: huggingface/transformers

馃殌 Feature

Motivation

When training run_lm_finetuning.py for a long time, a restore training feature should be added.

Otherwise, states of sheduler and optimizer are changed when restart.

For example, when it breaks at step checkpoint-30000, it will restart at step 0 with initial learning rate and other configs. This is really troublesome.

Thanks, please.

Additional context

Most helpful comment

If you want to resume training with the same learning rate, you can save the scheduler and optimizer and reload them when resuming training.

For example, you could save the current training state with:

# Save the model and tokenizer
model.save_pretrained('./checkpoints/')
tokenizer.save_pretrained('./checkpoints/')

# Save the optimizer and scheduler
torch.save(optimizer.state_dict(), './checkpoints/optimizer.pt')
torch.save(scheduler.state_dict(), './checkpoints/scheduler.pt')

And resume training with:

# Initialize model and tokenizer from checkpoints dir
model = BertModel.from_pretrained('./checkpoints/')
tokenizer = BertTokenizer.from_pretrained('./checkpoints/')

# Load optimizer and scheduler state
optimizer.load_state_dict(torch.load('./checkpoints/optimizer.pt'))
scheduler.load_state_dict(torch.load('./checkpoints/scheduler.pt'))

If you want more information, take a look at #839 and Pytorch's model serialization tutorial

If you want to resume training at the exact epoch and batch where you left off, like this person, you could save the epoch and batch number as well and continue all iterations until you reach the correct batch

All 3 comments

If you want to resume training with the same learning rate, you can save the scheduler and optimizer and reload them when resuming training.

For example, you could save the current training state with:

# Save the model and tokenizer
model.save_pretrained('./checkpoints/')
tokenizer.save_pretrained('./checkpoints/')

# Save the optimizer and scheduler
torch.save(optimizer.state_dict(), './checkpoints/optimizer.pt')
torch.save(scheduler.state_dict(), './checkpoints/scheduler.pt')

And resume training with:

# Initialize model and tokenizer from checkpoints dir
model = BertModel.from_pretrained('./checkpoints/')
tokenizer = BertTokenizer.from_pretrained('./checkpoints/')

# Load optimizer and scheduler state
optimizer.load_state_dict(torch.load('./checkpoints/optimizer.pt'))
scheduler.load_state_dict(torch.load('./checkpoints/scheduler.pt'))

If you want more information, take a look at #839 and Pytorch's model serialization tutorial

If you want to resume training at the exact epoch and batch where you left off, like this person, you could save the epoch and batch number as well and continue all iterations until you reach the correct batch

@bkkaggle Thanks for your reply, it really helps a lot!

Thank you!

@bkkaggle
However, the reasons that I change to PyTorch (Transformers by huggingface) are easy to use and thousands more positive ones.

Why not adding an universal functionality to smoothly support this feature, like TF checkpoint does?

I think that is a natural way to save checkpoint when training.

It sounds more troublesome to customize the checkpoint style by users themselves, considering the high-level encapsulation characteristic brought by the framework.

Was this page helpful?
0 / 5 - 0 ratings