Transformers: Longformer, a scalable transformer model for long-document NLP tasks

Created on 14 Apr 2020  路  10Comments  路  Source: huggingface/transformers

馃専 New model addition

Model description

This is an incredible project from the awesome https://github.com/allenai team that solves a big problem in transformers.

From https://twitter.com/i_beltagy/status/1249750021811011591

Excited to share our work on Longformer, a scalable transformer model for long-document NLP tasks without chunking/truncation to fit the 512 limit.
Work with @mattthemathman
, @armancohan

Code and pretrained model: http://github.com/allenai/longformer

We replace the standard self-attention with one that scales linearly with sequence length and that can flexibly adapt to downstream tasks. We continue pretraning from the RoBERTa checkpoint and evaluate on QA, coref, classification. Pretrained model supports seqlen 4,096

The small model archives sota results on enwik8 and text8 and large model gets close with half the parameters. Longformer's self-attention uses an efficient CUDA kernel that minimizes memory usage (char-lm large model, 23k tokens at training and 32k tokens at evaluation)

Open source status

Most helpful comment

Reformer will be added next week and then work will start on Longformer :-)

All 10 comments

Any updates on this? Just curious.

Reformer will be added next week and then work will start on Longformer :-)

Look forward to it!

Longformer is added now - closing!

@patrickvonplaten I have been using Longformer self attention with LongBart for summarisation recently and have done some side-by-side comparison to hf BartForConditionalGeneration. I noticed that LongBart is actually using more memory than hf BartForConditionalGeneration (when they're set up the equivalently). I looked into this and have found that this is coming from the self attention layer, i.e. Longformer self attention is using more memory than the normal multi-head self attention in BartForConditionalGeneration.

Wondering if this is expected or a bug? If it's expected, could you please explain? I thought the point of Longformer self attention was to reduce memory consumption...

It depends very much on the sequence length of your input. Did you benchmark your results using the benchmarking utils?

@alexgaskell10, what is the sequence length? If the sequence length is shorter than the window size (for LongBart, it is probably 1024), you will see a bit of an increase in memory. For sequences longer than the window size (say, 2048), LongformerSelfAttention should be much more memory efficient compared to regular selfattention.

Thanks to both for the quick responses. I have only tried with input lengths <= 1024 but nothing beyond that. Makes sense that the benefits of Longformer self attention are more evident as sequences get longer, thanks.

@patrickvonplaten no I didn't know there was a script for this already, I just used something I wrote. I'll have a look at this.

@ibeltagy the sequence length I have set equal to window size (and tried for several different values, all <= 1024). I thought that if I used a sequence length of 1024 and window size of 1024 then Longformer and multi-head self attention layers would be equivalent (thereby making LongBart and BartForConditionalGeneration equivalent). Is there some overhead to using Longformer self attention which means it is more costly for sequences <= 1024?

equivalent

they are not perfectly equivalent but close

which means it is more costly for sequences <= 1024?

yes, the current implementation has a bit of overhead with sequences shorter than the window length. We are planning to address that in the future. One way to do so is to switch to regular selfattention if the sequence is short, but this probably requires additional pertaining to teach the model to work with both types of selfattention.

Great, all makes sense. I'll run benchmarking for longer sequences and flag if anything unusual shows all. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fyubang picture fyubang  路  3Comments

alphanlp picture alphanlp  路  3Comments

quocnle picture quocnle  路  3Comments

hsajjad picture hsajjad  路  3Comments

lcswillems picture lcswillems  路  3Comments