Transformers: 🐛 Summarization pipeline : T5-base much slower than BART-large

Created on 3 Apr 2020 · 4Comments · Source: huggingface/transformers

🐛 Bug

Information

Model : bart-large-cnn and t5-base

Language : English

The problem arises when using : this colab notebook, using both BART and T5 with pipeline for Summarization.

Dataset : CNN/DM

To reproduce

Run the notebook and measure time for inference between the 2 models. On my run, I have :

BART = 73s
T5 = 369s

Expected behavior

I expected T5 to be at least as fast as BART, since there is less parameters (for the base version at least). Instead it takes much longer with T5...

@patrickvonplaten Do you happen to know why T5 is so slow ?

Pipeline

Source

astariul-colanim

👍1

Most helpful comment

Ok, good to know! thanks for doing the comparison @Colanim. This might interest you as well @sshleifer :-)

Oh actually I just remember that Bart caches the decoder hidden key/value outputs when doing auto-regressive decoding (similar to GPT2 - check Visuals under "GPT-2 Masked Self-Attention" in this post) and I think T5 does not.

But T5 could cache the decoder key/value outputs to speed up decoding as well since it uses a causal mask for the decoder. This could definitely be a Feature Request. What do you think
@sshleifer @craffel @thomwolf ?

patrickvonplaten on 3 Apr 2020

❤4 👍2

All 4 comments

Hi @Colanim, thanks a lot for your speed comparison :-).

It might be possible that the pipelines used different default parameters for T5 and Bart under the hood which strongly influence their running times.
Besides min_length and max_length could you also insert those parameters into both T5 and Bart to overwrite the default parameters:

      "early_stopping": True
      "length_penalty": 2.0
      "no_repeat_ngram_size": 3
      "num_beams": 4

If there is still a big difference in time, then I guess we have to take a closer look!

patrickvonplaten on 3 Apr 2020

Thanks for your fast answer @patrickvonplaten

Here is the link to the modified notebook, with the parameters you mentioned :
https://colab.research.google.com/drive/1kCm5ew8qDQqguZjbsC6Ujs9KZBaSfafi

Unfortunately, there is still a huge difference...

BART = 66s
T5 = 226s

astariul-colanim on 3 Apr 2020

Ok, good to know! thanks for doing the comparison @Colanim. This might interest you as well @sshleifer :-)

patrickvonplaten on 3 Apr 2020

❤4 👍2

Sounds worth it!

craffel on 3 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

BERT output not deterministic

yspaik · 3Comments

Need a Restore training mechenisim in run_lm_finetuning.py

chuanmingliu · 3Comments

Sudden catastrophic classification output during NER training

fabiocapsouza · 3Comments

_load_from_state_dict() takes 7 positional arguments but 8 were given

guanlongtianzi · 3Comments

Unable to get hidden states and attentions BertForSequenceClassification

delip · 3Comments