Transformers: Longformer, a scalable transformer model for long-document NLP tasks

Created on 14 Apr 2020 · 10Comments · Source: huggingface/transformers

🌟 New model addition

Model description

This is an incredible project from the awesome https://github.com/allenai team that solves a big problem in transformers.

From https://twitter.com/i_beltagy/status/1249750021811011591

Excited to share our work on Longformer, a scalable transformer model for long-document NLP tasks without chunking/truncation to fit the 512 limit.
Work with @mattthemathman
, @armancohan

Code and pretrained model: http://github.com/allenai/longformer

We replace the standard self-attention with one that scales linearly with sequence length and that can flexibly adapt to downstream tasks. We continue pretraning from the RoBERTa checkpoint and evaluate on QA, coref, classification. Pretrained model supports seqlen 4,096

The small model archives sota results on enwik8 and text8 and large model gets close with half the parameters. Longformer's self-attention uses an efficient CUDA kernel that minimizes memory usage (char-lm large model, 23k tokens at training and 32k tokens at evaluation)

Open source status

[X] the model implementation is available: (give details)
https://github.com/allenai/longformer
[X] the model weights are available: (give details)
Yes, at https://github.com/allenai/longformer
[ X] who are the authors: (mention them, if possible by @gh-username)
@ibeltagy @schmmd

Source

bratao

👍10 👀6 🎉5

Most helpful comment

Reformer will be added next week and then work will start on Longformer :-)

patrickvonplaten on 30 Apr 2020

👍12 🎉3

All 10 comments

Any updates on this? Just curious.

dkaushik96 on 30 Apr 2020

Reformer will be added next week and then work will start on Longformer :-)

patrickvonplaten on 30 Apr 2020

👍12 🎉3

Look forward to it!

jind11 on 1 May 2020

Longformer is added now - closing!

patrickvonplaten on 3 Jun 2020

🎉3 🚀2

@patrickvonplaten I have been using Longformer self attention with LongBart for summarisation recently and have done some side-by-side comparison to hf BartForConditionalGeneration. I noticed that LongBart is actually using more memory than hf BartForConditionalGeneration (when they're set up the equivalently). I looked into this and have found that this is coming from the self attention layer, i.e. Longformer self attention is using more memory than the normal multi-head self attention in BartForConditionalGeneration.

Wondering if this is expected or a bug? If it's expected, could you please explain? I thought the point of Longformer self attention was to reduce memory consumption...

alexgaskell10 on 6 Jul 2020

It depends very much on the sequence length of your input. Did you benchmark your results using the benchmarking utils?

patrickvonplaten on 6 Jul 2020

👍1

@alexgaskell10, what is the sequence length? If the sequence length is shorter than the window size (for LongBart, it is probably 1024), you will see a bit of an increase in memory. For sequences longer than the window size (say, 2048), LongformerSelfAttention should be much more memory efficient compared to regular selfattention.

ibeltagy on 6 Jul 2020

👍1

Thanks to both for the quick responses. I have only tried with input lengths <= 1024 but nothing beyond that. Makes sense that the benefits of Longformer self attention are more evident as sequences get longer, thanks.

@patrickvonplaten no I didn't know there was a script for this already, I just used something I wrote. I'll have a look at this.

@ibeltagy the sequence length I have set equal to window size (and tried for several different values, all <= 1024). I thought that if I used a sequence length of 1024 and window size of 1024 then Longformer and multi-head self attention layers would be equivalent (thereby making LongBart and BartForConditionalGeneration equivalent). Is there some overhead to using Longformer self attention which means it is more costly for sequences <= 1024?

alexgaskell10 on 6 Jul 2020

equivalent

they are not perfectly equivalent but close

which means it is more costly for sequences <= 1024?

yes, the current implementation has a bit of overhead with sequences shorter than the window length. We are planning to address that in the future. One way to do so is to switch to regular selfattention if the sequence is short, but this probably requires additional pertaining to teach the model to work with both types of selfattention.

ibeltagy on 6 Jul 2020

👍1

Great, all makes sense. I'll run benchmarking for longer sequences and flag if anything unusual shows all. Thanks!

alexgaskell10 on 6 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings