I have a few questions related to the Wiki103 pretrained model and the provided training script.
1) In the training script code you have
--max-tokens 3072 --tokens-per-sample 3072
However in the paper, you state that
For WIKITEXT-103 we partition the training data into blocks of 512 contiguous tokens
I'm wondering where/how this is happening given the provided training example or if the training example does not match the paper? In general, I am confused about how batch size is determined in the fairseq framework. Running the below code with the wiki103 comandline args provided gives src_tokens with size [1, 3072].
2) For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?
3) Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?
reg_task = LanguageModelingTask.setup_task(args)
reg_task.load_dataset(split)
reg_iter = reg_task.get_batch_iterator(reg_task.datasets[split], max_tokens=args.max_tokens,
max_sentences=args.max_sentences,
max_positions=args.max_target_positions)
reg_e_iter = reg_iter.next_epoch_itr(shuffle=True)
for sample in reg_e_iter:
print(sample, sample['id'].shape, 'id shape')
print(sample['net_input']['src_tokens'].shape)
pip, source): pip(...) However in the paper, you state that (...)
See Section 5.1 of the paper: "Table 2 shows our result on WIKITEXT-103 where adaptive inputs achieve 18.7 perplexity. For this result only, we partition the training data into blocks of 3072 contiguous tokens instead of 512 tokens as for other experiments." I believe this is the model that was released.
For multiple gpus, are --max-tokens --tokens-per-sample per gpu or do they get split across gpus?
--max-tokens and --tokens-per-sample are per GPU. So if you have two GPUs then you'll effectively have double the max tokens.
Loading the the model, the saved args have the arch as 'transformer_lm_gbw' and not 'transformer_lm_wiki103'. Why is this?
You can mostly ignore the "arch" value in the checkpoint, since the other configuration can be overridden elsewhere in the args. You should look at decoder_layers, decoder_embed_dim, ..., directly.
Thanks, that helped a lot!
Most helpful comment
See Section 5.1 of the paper: "Table 2 shows our result on WIKITEXT-103 where adaptive inputs achieve 18.7 perplexity. For this result only, we partition the training data into blocks of 3072 contiguous tokens instead of 512 tokens as for other experiments." I believe this is the model that was released.
--max-tokensand--tokens-per-sampleare per GPU. So if you have two GPUs then you'll effectively have double the max tokens.You can mostly ignore the "arch" value in the checkpoint, since the other configuration can be overridden elsewhere in the args. You should look at
decoder_layers,decoder_embed_dim, ..., directly.