Fairseq: Where is the optimization performed?

Created on 11 Oct 2017 · 6Comments · Source: pytorch/fairseq

It is announced that generation is speed up by 80%, but where time saves at?

Source

Zrachel

Most helpful comment

In addition to what Myle said, the increased number of epochs is probably due to different annealing strategy used in pytorch version vs lua. In lua, once we hit bad valid score we start forced annealing all the way to min learning rate. In pytorch we wait till we hit bad valid score for every learning rate. That adds up 10 - 15 epochs on small datasets. To make a fair comparison you can set --force-anneal option in pytorch and correspondignly -forceanneal in lua

edunov on 13 Oct 2017

👍2

All 6 comments

That estimate was based on comparing generation times of the Lua version to the PyTorch version. We haven't done a detailed breakdown of the speed improvement yet, but some likely causes are:
1) The beam search code in fairseq-py is more optimized compared to the Lua version. For example, the Lua version iterates over the candidate hypotheses to select the active hypotheses for the next step [1], while the PyTorch version uses batch Tensor operations to select the active hypotheses and only iterates over finalized hypotheses [2].
2) There seem to be general speed improvements in PyTorch compared to Lua Torch, which results in faster inference time for this model.

[1] https://github.com/facebookresearch/fairseq/blob/b08530ae7332d4f8ca2d9ad470ea651fd5e22ba5/fairseq/search.lua#L215,L282
[2] https://github.com/facebookresearch/fairseq-py/blob/03c4a71698ad1f64f08f83196987f655b05ef181/fairseq/sequence_generator.py#L247-L291

myleott on 11 Oct 2017

Thank you to your terse response. Another question about the optimization in the training process.
I trained on nist06(2.56 million training data) dataset by fairseq and fairseq-py with exactly same architecture, both on p40, the speed of them is as follows:

| tool | total training time(hours) | total epochs | speed(hours/epoch)
| -------- | :-----: | :----: | :----: |
| fairseq | 44.5 | 32 | 1.39 |
| fairseq-py | 36.1 | 44 | 0.82 |

While the training speed is more faster, the total epochs increases. Is it normal? And why fairseq-py needs more epochs? @myleott

Zrachel on 13 Oct 2017

First, are you sure you're using exactly the same configuration (e.g., model architecture, learning rate, norm clipping, etc.)?

If so, there are a few other differences between the Lua and PyTorch versions:

the Lua version uses vanilla SGD (without momentum) by default, but supports other optimizers (e.g., NAG and Adam). By default the PyTorch version uses NAG with a momentum of 0.99. You can try comparing both Lua and PyTorch versions without momentum (i.e., add --momentum 0 to PyTorch).
the Lua version generates batches with a fixed number of sentences, whereas the PyTorch version generates batches with a fixed number of tokens (but a variable number of sentences). This might affect the optimal choice of learning rate for your problem, so you should tune it separately for Lua vs PyTorch.

myleott on 13 Oct 2017

edunov on 13 Oct 2017

👍2

Thank you @myleott and @edunov , I've compared the two logs:
IN LUA:
fairseq train -datadir data-bin -savedir model -sourcelang zh -targetlang en -model fconv -nenclayer 12 -nlayer 12 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 -momentum 0.99 -timeavg -bptt 0 -nembed 512 -noutembed 512 -nhid 512
and
IN PYTHON:
Namespace(arch='fconv', clip_norm=0.1, data='data-bin', decoder_attention='True', decoder_embed_dim=512, decoder_layers='[(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3)]', decoder_out_embed_dim=256, dropout=0.2, encoder_embed_dim=512, encoder_layers='[(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3),(512,3)]', force_anneal=0, label_smoothing=0, log_interval=1000, lr=0.25, lrshrink=0.1, max_epoch=0, max_positions=1024, max_tokens=6000, min_lr=1e-05, model='fconv', momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, restore_file='checkpoint_last.pt', sample_without_replacement=0, save_dir='model', save_interval=-1, seed=1, source_lang='zh', target_lang='en', test_subset='test', train_subset='train', valid_subset='valid', weight_decay=0.0, workers=1)

It seems no difference in _learning rate, clip norm, optimizer, momentum, and arch_
while some other points still confuses me:

decoder_out_embed_dim=256 differs from -noutembed 512, my fault, I ignored the default value of decoder_out_embed_dim=256
Would timeavg and bptt 0 make difference?
@myleott Does the generation scheme have a big impact if I don't tune the learning rate?
@edunov Is there any experiments support the new annealing strategy in pytorch? And: fairseq train --help | grep anneal get no forceanneal, do you mean -annealing_type=fast/slow?

Thank you.

Zrachel on 16 Oct 2017

Would timeavg and bptt 0 make difference?

It shouldn't matter. timeavg just normalizes the gradient by the number of tokens, which is the default behavior in fairseq-py. And bptt is only relevant for RNNs, so shouldn't matter for fconv models.

Does the generation scheme have a big impact if I don't tune the learning rate?

Learning rate can affect convergence, but I wouldn't expect a huge difference from just the batching change.

Is there any experiments support the new annealing strategy in pytorch? And: fairseq train --help | grep anneal get no forceanneal, do you mean -annealing_type=fast/slow?

I'm not sure there's any reason to prefer one or the other default annealing strategy, but for fairseq-py we just reused the functionality built into PyTorch via torch.optim.lr_scheduler (whereas Lua Torch doesn't provide this functionality).

Re: the -forceanneal option, it seems it wasn't added to the Lua version. Here's the code to add it: https://github.com/myleott/fairseq/commit/3628305790b913abf57b17032376ef9eadbebd48

myleott on 16 Oct 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings