In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify
--truncate-source.
By running the README command, I have following warning :
WARNING: 86040 samples have invalid sizes and will be skipped [...]
But if I specify the option --truncate-source, the warning become :
WARNING: 5 samples have invalid sizes and will be skipped [...]
Why the option --truncate-source is not used ?
It feels like skipping 80k samples is detrimental for performance, and other existing architectures always truncate article if it's too long..
@ngoyal2707 @yinhanliu
it is truncated in the code,
They should result the same.
Thanks for the answer @yinhanliu
The code you link appear to be in the scope of this if :
https://github.com/pytorch/fairseq/blob/5349052aae4ec1350822c894fbb6be350dff61a0/fairseq/tasks/translation.py#L55-L62
So if the option --truncate-source is not specified, we never enter this if, right ?
@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting.
Thanks for pointing this out to me.
closing after Yinhan's answer
I don't think it should be closed yet : we are still waiting for a confirmation from @ngoyal2707 .
@Colanim Yes, we truncated the long sources, in fact, that's the very reason I introduced the --truncate-source arg in the code.
Just a missing arg in readme, will add the fix. Thanks for pointing out @Colanim
Most helpful comment
@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting.
Thanks for pointing this out to me.