Fairseq: Batch size of wiki103 model

Created on 2 Mar 2020 · 2Comments · Source: pytorch/fairseq

❓ Questions and Help

What is your question?

I have a few questions related to the Wiki103 pretrained model and the provided training script.

1) In the training script code you have

--max-tokens 3072 --tokens-per-sample 3072

However in the paper, you state that

For WIKITEXT-103 we partition the training data into blocks of 512 contiguous tokens

I'm wondering where/how this is happening given the provided training example or if the training example does not match the paper? In general, I am confused about how batch size is determined in the fairseq framework. Running the below code with the wiki103 comandline args provided gives src_tokens with size [1, 3072].

2) For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?

3) Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?

Code

        reg_task = LanguageModelingTask.setup_task(args)
        reg_task.load_dataset(split)
        reg_iter = reg_task.get_batch_iterator(reg_task.datasets[split], max_tokens=args.max_tokens,
                                               max_sentences=args.max_sentences,
                                               max_positions=args.max_target_positions)
        reg_e_iter = reg_iter.next_epoch_itr(shuffle=True)

        for sample in reg_e_iter:
            print(sample, sample['id'].shape, 'id shape')
            print(sample['net_input']['src_tokens'].shape)

What's your environment?

fairseq Version (e.g., 1.0 or master): 0.9
PyTorch Version (e.g., 1.0) 1.4
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 10.1
GPU models and configuration: TitanX and others
Any other relevant information:

question

Source

arvieFrydenlund

Most helpful comment

(...) However in the paper, you state that (...)

See Section 5.1 of the paper: "Table 2 shows our result on WIKITEXT-103 where adaptive inputs achieve 18.7 perplexity. For this result only, we partition the training data into blocks of 3072 contiguous tokens instead of 512 tokens as for other experiments." I believe this is the model that was released.

For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?

--max-tokens and --tokens-per-sample are per GPU. So if you have two GPUs then you'll effectively have double the max tokens.

Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?

You can mostly ignore the "arch" value in the checkpoint, since the other configuration can be overridden elsewhere in the args. You should look at decoder_layers, decoder_embed_dim, ..., directly.

myleott on 3 Mar 2020

👍2

All 2 comments

(...) However in the paper, you state that (...)

For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?

--max-tokens and --tokens-per-sample are per GPU. So if you have two GPUs then you'll effectively have double the max tokens.

Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?

myleott on 3 Mar 2020

👍2

Thanks, that helped a lot!

arvieFrydenlund on 3 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Any performance comparison between pre-norm and post-norm for Transformer on Machine Translation

gaopengcuhk · 3Comments

Reproduce Billion Word benchmark for paper by Baevski and Auli, 2018.

yilegu · 3Comments

AdaFactor to save GPU memory?

AranKomat · 3Comments

UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 8: ordinal not in range(128)

zhaoxv · 3Comments

Currently fairseq-py requires PyTorch version >= 0.4.0 ?

mali-nuist · 3Comments