Fairseq: Finetuning BART on CNN/DM : Why not truncating long articles ?

Created on 22 Nov 2019  路  6Comments  路  Source: pytorch/fairseq

In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify
--truncate-source.

By running the README command, I have following warning :

WARNING: 86040 samples have invalid sizes and will be skipped [...]

But if I specify the option --truncate-source, the warning become :

WARNING: 5 samples have invalid sizes and will be skipped [...]


Why the option --truncate-source is not used ?
It feels like skipping 80k samples is detrimental for performance, and other existing architectures always truncate article if it's too long..

@ngoyal2707 @yinhanliu

Most helpful comment

@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting.
Thanks for pointing this out to me.

All 6 comments

Thanks for the answer @yinhanliu

The code you link appear to be in the scope of this if :
https://github.com/pytorch/fairseq/blob/5349052aae4ec1350822c894fbb6be350dff61a0/fairseq/tasks/translation.py#L55-L62

So if the option --truncate-source is not specified, we never enter this if, right ?

@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting.
Thanks for pointing this out to me.

closing after Yinhan's answer

I don't think it should be closed yet : we are still waiting for a confirmation from @ngoyal2707 .

@Colanim Yes, we truncated the long sources, in fact, that's the very reason I introduced the --truncate-source arg in the code.
Just a missing arg in readme, will add the fix. Thanks for pointing out @Colanim

Was this page helpful?
0 / 5 - 0 ratings