It is announced that generation is speed up by 80%, but where time saves at?
That estimate was based on comparing generation times of the Lua version to the PyTorch version. We haven't done a detailed breakdown of the speed improvement yet, but some likely causes are:
1) The beam search code in fairseq-py is more optimized compared to the Lua version. For example, the Lua version iterates over the candidate hypotheses to select the active hypotheses for the next step [1], while the PyTorch version uses batch Tensor operations to select the active hypotheses and only iterates over finalized hypotheses [2].
2) There seem to be general speed improvements in PyTorch compared to Lua Torch, which results in faster inference time for this model.
[1] https://github.com/facebookresearch/fairseq/blob/b08530ae7332d4f8ca2d9ad470ea651fd5e22ba5/fairseq/search.lua#L215,L282
[2] https://github.com/facebookresearch/fairseq-py/blob/03c4a71698ad1f64f08f83196987f655b05ef181/fairseq/sequence_generator.py#L247-L291
Thank you to your terse response. Another question about the optimization in the training process.
I trained on nist06(2.56 million training data) dataset by fairseq and fairseq-py with exactly same architecture, both on p40, the speed of them is as follows:
| tool | total training time(hours) | total epochs | speed(hours/epoch)
| -------- | :-----: | :----: | :----: |
| fairseq | 44.5 | 32 | 1.39 |
| fairseq-py | 36.1 | 44 | 0.82 |
While the training speed is more faster, the total epochs increases. Is it normal? And why fairseq-py needs more epochs? @myleott
First, are you sure you're using exactly the same configuration (e.g., model architecture, learning rate, norm clipping, etc.)?
If so, there are a few other differences between the Lua and PyTorch versions:
--momentum 0 to PyTorch).In addition to what Myle said, the increased number of epochs is probably due to different annealing strategy used in pytorch version vs lua. In lua, once we hit bad valid score we start forced annealing all the way to min learning rate. In pytorch we wait till we hit bad valid score for every learning rate. That adds up 10 - 15 epochs on small datasets. To make a fair comparison you can set --force-anneal option in pytorch and correspondignly -forceanneal in lua
Thank you @myleott and @edunov , I've compared the two logs:
IN LUA:
fairseq train -datadir data-bin -savedir model -sourcelang zh -targetlang en -model fconv -nenclayer 12 -nlayer 12 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 -momentum 0.99 -timeavg -bptt 0 -nembed 512 -noutembed 512 -nhid 512
and
IN PYTHON:
Namespace(arch='fconv', clip_norm=0.1, data='data-bin', decoder_attention='True', decoder_embed_dim=512, decoder_layers='[(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3)]', decoder_out_embed_dim=256, dropout=0.2, encoder_embed_dim=512, encoder_layers='[(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3)]', force_anneal=0, label_smoothing=0, log_interval=1000, lr=0.25, lrshrink=0.1, max_epoch=0, max_positions=1024, max_tokens=6000, min_lr=1e-05, model='fconv', momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, restore_file='checkpoint_last.pt', sample_without_replacement=0, save_dir='model', save_interval=-1, seed=1, source_lang='zh', target_lang='en', test_subset='test', train_subset='train', valid_subset='valid', weight_decay=0.0, workers=1)
It seems no difference in _learning rate, clip norm, optimizer, momentum, and arch_
while some other points still confuses me:
decoder_out_embed_dim=256 differs from -noutembed 512, my fault, I ignored the default value of decoder_out_embed_dim=256
Would timeavg and bptt 0 make difference?
@myleott Does the generation scheme have a big impact if I don't tune the learning rate?
@edunov Is there any experiments support the new annealing strategy in pytorch? And: fairseq train --help | grep anneal get no forceanneal, do you mean -annealing_type=fast/slow?
Thank you.
- Would
timeavgandbptt 0make difference?
It shouldn't matter. timeavg just normalizes the gradient by the number of tokens, which is the default behavior in fairseq-py. And bptt is only relevant for RNNs, so shouldn't matter for fconv models.
- Does the generation scheme have a big impact if I don't tune the learning rate?
Learning rate can affect convergence, but I wouldn't expect a huge difference from just the batching change.
- Is there any experiments support the new annealing strategy in pytorch? And:
fairseq train --help | grep annealget noforceanneal, do you mean-annealing_type=fast/slow?
I'm not sure there's any reason to prefer one or the other default annealing strategy, but for fairseq-py we just reused the functionality built into PyTorch via torch.optim.lr_scheduler (whereas Lua Torch doesn't provide this functionality).
Re: the -forceanneal option, it seems it wasn't added to the Lua version. Here's the code to add it: https://github.com/myleott/fairseq/commit/3628305790b913abf57b17032376ef9eadbebd48
Most helpful comment
In addition to what Myle said, the increased number of epochs is probably due to different annealing strategy used in pytorch version vs lua. In lua, once we hit bad valid score we start forced annealing all the way to min learning rate. In pytorch we wait till we hit bad valid score for every learning rate. That adds up 10 - 15 epochs on small datasets. To make a fair comparison you can set --force-anneal option in pytorch and correspondignly -forceanneal in lua