Transformers: 馃悰 Summarization pipeline : T5-base much slower than BART-large

Created on 3 Apr 2020  路  4Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model : bart-large-cnn and t5-base

Language : English

The problem arises when using : this colab notebook, using both BART and T5 with pipeline for Summarization.

Dataset : CNN/DM

To reproduce

Run the notebook and measure time for inference between the 2 models. On my run, I have :

BART = 73s
T5 = 369s

Expected behavior

I expected T5 to be at least as fast as BART, since there is less parameters (for the base version at least). Instead it takes much longer with T5...

@patrickvonplaten Do you happen to know why T5 is so slow ?

Pipeline

Most helpful comment

Ok, good to know! thanks for doing the comparison @Colanim. This might interest you as well @sshleifer :-)

Oh actually I just remember that Bart caches the decoder hidden key/value outputs when doing auto-regressive decoding (similar to GPT2 - check Visuals under "GPT-2 Masked Self-Attention" in this post) and I think T5 does not.

But T5 could cache the decoder key/value outputs to speed up decoding as well since it uses a causal mask for the decoder. This could definitely be a Feature Request. What do you think
@sshleifer @craffel @thomwolf ?

All 4 comments

Hi @Colanim, thanks a lot for your speed comparison :-).

It might be possible that the pipelines used different default parameters for T5 and Bart under the hood which strongly influence their running times.
Besides min_length and max_length could you also insert those parameters into both T5 and Bart to overwrite the default parameters:

      "early_stopping": True
      "length_penalty": 2.0
      "no_repeat_ngram_size": 3
      "num_beams": 4

If there is still a big difference in time, then I guess we have to take a closer look!

Thanks for your fast answer @patrickvonplaten

Here is the link to the modified notebook, with the parameters you mentioned :
https://colab.research.google.com/drive/1kCm5ew8qDQqguZjbsC6Ujs9KZBaSfafi


Unfortunately, there is still a huge difference...

BART = 66s
T5 = 226s

Ok, good to know! thanks for doing the comparison @Colanim. This might interest you as well @sshleifer :-)

Oh actually I just remember that Bart caches the decoder hidden key/value outputs when doing auto-regressive decoding (similar to GPT2 - check Visuals under "GPT-2 Masked Self-Attention" in this post) and I think T5 does not.

But T5 could cache the decoder key/value outputs to speed up decoding as well since it uses a causal mask for the decoder. This could definitely be a Feature Request. What do you think
@sshleifer @craffel @thomwolf ?

Sounds worth it!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lcswillems picture lcswillems  路  3Comments

lemonhu picture lemonhu  路  3Comments

alphanlp picture alphanlp  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

adigoryl picture adigoryl  路  3Comments