This is an incredible project from the awesome https://github.com/allenai team that solves a big problem in transformers.
From https://twitter.com/i_beltagy/status/1249750021811011591
Excited to share our work on Longformer, a scalable transformer model for long-document NLP tasks without chunking/truncation to fit the 512 limit.
Work with @mattthemathman
, @armancohan
Code and pretrained model: http://github.com/allenai/longformer
We replace the standard self-attention with one that scales linearly with sequence length and that can flexibly adapt to downstream tasks. We continue pretraning from the RoBERTa checkpoint and evaluate on QA, coref, classification. Pretrained model supports seqlen 4,096
The small model archives sota results on enwik8 and text8 and large model gets close with half the parameters. Longformer's self-attention uses an efficient CUDA kernel that minimizes memory usage (char-lm large model, 23k tokens at training and 32k tokens at evaluation)
[X] the model implementation is available: (give details)
https://github.com/allenai/longformer
[X] the model weights are available: (give details)
Yes, at https://github.com/allenai/longformer
[ X] who are the authors: (mention them, if possible by @gh-username)
@ibeltagy @schmmd
Any updates on this? Just curious.
Reformer will be added next week and then work will start on Longformer :-)
Look forward to it!
Longformer is added now - closing!
@patrickvonplaten I have been using Longformer
self attention with LongBart
for summarisation recently and have done some side-by-side comparison to hf BartForConditionalGeneration
. I noticed that LongBart
is actually using more memory than hf BartForConditionalGeneration
(when they're set up the equivalently). I looked into this and have found that this is coming from the self attention layer, i.e. Longformer
self attention is using more memory than the normal multi-head self attention in BartForConditionalGeneration
.
Wondering if this is expected or a bug? If it's expected, could you please explain? I thought the point of Longformer
self attention was to reduce memory consumption...
It depends very much on the sequence length of your input. Did you benchmark your results using the benchmarking utils?
@alexgaskell10, what is the sequence length? If the sequence length is shorter than the window size (for LongBart, it is probably 1024), you will see a bit of an increase in memory. For sequences longer than the window size (say, 2048), LongformerSelfAttention
should be much more memory efficient compared to regular selfattention.
Thanks to both for the quick responses. I have only tried with input lengths <= 1024 but nothing beyond that. Makes sense that the benefits of Longformer
self attention are more evident as sequences get longer, thanks.
@patrickvonplaten no I didn't know there was a script for this already, I just used something I wrote. I'll have a look at this.
@ibeltagy the sequence length I have set equal to window size (and tried for several different values, all <= 1024). I thought that if I used a sequence length of 1024 and window size of 1024 then Longformer
and multi-head self attention layers would be equivalent (thereby making LongBart
and BartForConditionalGeneration
equivalent). Is there some overhead to using Longformer
self attention which means it is more costly for sequences <= 1024?
equivalent
they are not perfectly equivalent but close
which means it is more costly for sequences <= 1024?
yes, the current implementation has a bit of overhead with sequences shorter than the window length. We are planning to address that in the future. One way to do so is to switch to regular selfattention if the sequence is short, but this probably requires additional pertaining to teach the model to work with both types of selfattention.
Great, all makes sense. I'll run benchmarking for longer sequences and flag if anything unusual shows all. Thanks!
Most helpful comment
Reformer will be added next week and then work will start on Longformer :-)