Fairseq: Batch size of wiki103 model

Created on 2 Mar 2020  ยท  2Comments  ยท  Source: pytorch/fairseq

โ“ Questions and Help

What is your question?

I have a few questions related to the Wiki103 pretrained model and the provided training script.

1) In the training script code you have

--max-tokens 3072 --tokens-per-sample 3072

However in the paper, you state that

For WIKITEXT-103 we partition the training data into blocks of 512 contiguous tokens

I'm wondering where/how this is happening given the provided training example or if the training example does not match the paper? In general, I am confused about how batch size is determined in the fairseq framework. Running the below code with the wiki103 comandline args provided gives src_tokens with size [1, 3072].

2) For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?

3) Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?

Code

        reg_task = LanguageModelingTask.setup_task(args)
        reg_task.load_dataset(split)
        reg_iter = reg_task.get_batch_iterator(reg_task.datasets[split], max_tokens=args.max_tokens,
                                               max_sentences=args.max_sentences,
                                               max_positions=args.max_target_positions)
        reg_e_iter = reg_iter.next_epoch_itr(shuffle=True)

        for sample in reg_e_iter:
            print(sample, sample['id'].shape, 'id shape')
            print(sample['net_input']['src_tokens'].shape)

What's your environment?

  • fairseq Version (e.g., 1.0 or master): 0.9
  • PyTorch Version (e.g., 1.0) 1.4
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: TitanX and others
  • Any other relevant information:
question

Most helpful comment

(...) However in the paper, you state that (...)

See Section 5.1 of the paper: "Table 2 shows our result on WIKITEXT-103 where adaptive inputs achieve 18.7 perplexity. For this result only, we partition the training data into blocks of 3072 contiguous tokens instead of 512 tokens as for other experiments." I believe this is the model that was released.

For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?

--max-tokens and --tokens-per-sample are per GPU. So if you have two GPUs then you'll effectively have double the max tokens.

Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?

You can mostly ignore the "arch" value in the checkpoint, since the other configuration can be overridden elsewhere in the args. You should look at decoder_layers, decoder_embed_dim, ..., directly.

All 2 comments

(...) However in the paper, you state that (...)

See Section 5.1 of the paper: "Table 2 shows our result on WIKITEXT-103 where adaptive inputs achieve 18.7 perplexity. For this result only, we partition the training data into blocks of 3072 contiguous tokens instead of 512 tokens as for other experiments." I believe this is the model that was released.

For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?

--max-tokens and --tokens-per-sample are per GPU. So if you have two GPUs then you'll effectively have double the max tokens.

Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?

You can mostly ignore the "arch" value in the checkpoint, since the other configuration can be overridden elsewhere in the args. You should look at decoder_layers, decoder_embed_dim, ..., directly.

Thanks, that helped a lot!

Was this page helpful?
0 / 5 - 0 ratings