Fairseq: max-positions vs tokens-per-sample

Created on 26 Nov 2018  路  4Comments  路  Source: pytorch/fairseq

Hi!

thanks for your effort in creating this repository.

Could you please explain the difference between parameters max-positions and tokens-per-sample in context of training language model? (specifically, I am training Transformer_lm on wikitext-103, as in your example) Overall, I am quite uncertain, how we go from a chunk of text to batches (in terms of sizes).

For example, if we assume that we have a chunk of text and we selected the parameter '--max-sentences', or '--batch-size' (I think its the same for lm) as 100, --max-tokens to 3500 and --tokens-per-sample to 35, then we should get batches with around 35 words per sample?

Please, help! :)

Most helpful comment

Hi!

--tokens-per-sample controls the number of tokens in each training (and evaluation) example. WikiText-103 is normally trained in "block" mode (--sample-break-mode) which places n contiguous tokens into each example. In this case, n = tokens-per-sample.

--max-positions in general is used to a) set the size of learned positional embeddings (if they are used instead of sinusoidal) and b) throw out examples that are longer than max positions. When training LMs in block mode, this flag should probably not be set. it will default automatically to the value in --tokens-per-sample

Hope this helps

All 4 comments

I think they set the max_[*]_position to be equal to the tokens-per-sample here:
https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer.py#L223

The decoder later uses the max_[*]_position variable in its max_position() function:
https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer.py#L510

I think your understanding is generally right. If max-tokens is N and tokens-per-sample is M, then the effective batch size is N/M (assuming you are not accumulating gradients, and only run on 1 GPU). Each sequence will have length M.

BTW, have you been able to train a good WT103 language model so far? What hyperparameters are you using, and how fast does your model converge?

Hi!

--tokens-per-sample controls the number of tokens in each training (and evaluation) example. WikiText-103 is normally trained in "block" mode (--sample-break-mode) which places n contiguous tokens into each example. In this case, n = tokens-per-sample.

--max-positions in general is used to a) set the size of learned positional embeddings (if they are used instead of sinusoidal) and b) throw out examples that are longer than max positions. When training LMs in block mode, this flag should probably not be set. it will default automatically to the value in --tokens-per-sample

Hope this helps

Thanks for you swift responses, they are really helpful!

@jerrybai1995 right now my model converges to 38/45 ppl (train/valid) in 10 epochs. I am not sure if it might be called a good lm, what do you think? I used mostly default params (from args, not from README.md in examples/language-model, adding dropout 0.1, max-tokens 3500, tokens-per-sample 35, batch-size 100.

Have you tried different parameters? Whats the lowest perplexity you have obtained?

@chledowski Yes, I observed similar performance after 10 epochs, but without so much overfitting (I got 48/45). Based on my experience setting tokens-per-sample to some larger values would help the performance (don't set it too large), but I haven't tried many settings either. I probably would increase the dropout rate a bit, given the generalization difficulties observed.

Keep me posted on the perplexities! :-)

Was this page helpful?
0 / 5 - 0 ratings