Model : bart-large-cnn and t5-base
Language : English
The problem arises when using : this colab notebook, using both BART and T5 with pipeline for Summarization.
Dataset : CNN/DM
Run the notebook and measure time for inference between the 2 models. On my run, I have :
BART = 73s
T5 = 369s
I expected T5 to be at least as fast as BART, since there is less parameters (for the base version at least). Instead it takes much longer with T5...
@patrickvonplaten Do you happen to know why T5 is so slow ?
Hi @Colanim, thanks a lot for your speed comparison :-).
It might be possible that the pipelines used different default parameters for T5 and Bart under the hood which strongly influence their running times.
Besides min_length and max_length could you also insert those parameters into both T5 and Bart to overwrite the default parameters:
"early_stopping": True
"length_penalty": 2.0
"no_repeat_ngram_size": 3
"num_beams": 4
If there is still a big difference in time, then I guess we have to take a closer look!
Thanks for your fast answer @patrickvonplaten
Here is the link to the modified notebook, with the parameters you mentioned :
https://colab.research.google.com/drive/1kCm5ew8qDQqguZjbsC6Ujs9KZBaSfafi
Unfortunately, there is still a huge difference...
BART = 66s
T5 = 226s
Ok, good to know! thanks for doing the comparison @Colanim. This might interest you as well @sshleifer :-)
Oh actually I just remember that Bart caches the decoder hidden key/value outputs when doing auto-regressive decoding (similar to GPT2 - check Visuals under "GPT-2 Masked Self-Attention" in this post) and I think T5 does not.
But T5 could cache the decoder key/value outputs to speed up decoding as well since it uses a causal mask for the decoder. This could definitely be a Feature Request. What do you think
@sshleifer @craffel @thomwolf ?
Sounds worth it!
Most helpful comment
Ok, good to know! thanks for doing the comparison @Colanim. This might interest you as well @sshleifer :-)
Oh actually I just remember that Bart caches the decoder hidden key/value outputs when doing auto-regressive decoding (similar to GPT2 - check Visuals under "GPT-2 Masked Self-Attention" in this post) and I think T5 does not.
But T5 could cache the decoder key/value outputs to speed up decoding as well since it uses a causal mask for the decoder. This could definitely be a Feature Request. What do you think
@sshleifer @craffel @thomwolf ?