Bert: Plans to support longer sequences?

Created on 2 Nov 2018  路  9Comments  路  Source: google-research/bert

Right now, the model (correct me if I'm wrong) appears to be locked down to sequences of max 512, based on running & playing with the code (and this makes sense in the context of the paper).

Are there any near-term plans to support longer sequences?

Offhand, this would potentially require multiple issues to be addressed, including 1) allowing positional embeddings that can extend for longer or perhaps arbitrary lengths (with some degradation over longer lengths than it has been trained on, of course) (possibly using something like multiple sinusoidal embeddings, like in the original transformer paper?) and 2) containing/limiting the Transformer quadratic memory explosion (my first gut would be to try something like the techniques in "Generating Wikipedia by Summarizing Long Sequences" https://arxiv.org/abs/1801.10198).

Right now--from first pass--it seems like the way to use this over longer sequences is to chunk the docs into sequences (either inline with fixed lengths, or possibly as pre-processing on boundaries like sentences or paragraphs) and apply BERT in a feature-input mode, and then feed into something else downstream (like universal transformer).

All of this seems doable, but is 1) more complicated from an engineering perspective and 2) loses the ability to fine-tune (at least in any way that is obvious to me).

(Of course, having a model adept to longer sequences like in https://arxiv.org/abs/1801.10198 has model power trade-offs, such that it is plausible that the feature-based approach could still plausibly be more superior?)

Most helpful comment

We don't plan to make major changes to this library, so anything like that would be part of a separate project.

Our recommended recipe is exactly what you describe (it's what we do for SQuAD), but you can actually fine-tune on it normally (we just don't do it for SQuAD because only a few percent of SQuAD documents are longer than 384 do so it didnt matter. But we should have).

Let's say you have:

the man went to the store and bought a gallon of milk

And had max_seq_length = 6, stride = 3, then you could split it up like this:

the man went to the store
to the store and bought a
and bought a gallon of milk

So from BertModel's perspective this is a 3x6 minibatch, but crucially you can reshape it after you get it back from BertModel.get_sequence_output() and softmax over all the tokens when you compute the loss (with some masking to make sure you don't double count the boundary words like to the store and and bought a). So you will be fine-tuning over the whole document end-to-end. The exact implementation is task-specific of course.

All 9 comments

We don't plan to make major changes to this library, so anything like that would be part of a separate project.

Our recommended recipe is exactly what you describe (it's what we do for SQuAD), but you can actually fine-tune on it normally (we just don't do it for SQuAD because only a few percent of SQuAD documents are longer than 384 do so it didnt matter. But we should have).

Let's say you have:

the man went to the store and bought a gallon of milk

And had max_seq_length = 6, stride = 3, then you could split it up like this:

the man went to the store
to the store and bought a
and bought a gallon of milk

So from BertModel's perspective this is a 3x6 minibatch, but crucially you can reshape it after you get it back from BertModel.get_sequence_output() and softmax over all the tokens when you compute the loss (with some masking to make sure you don't double count the boundary words like to the store and and bought a). So you will be fine-tuning over the whole document end-to-end. The exact implementation is task-specific of course.

to the store a

Hi,
It looks like a good solution wherein a longer sequence is broken down into shorter sequences. I was wondering if it is feasible to apply the same technique to sequence of length ~100,000 tokens.
Also, could you elaborate more on reshaping from the implementation point of view?
Thanks.

Hi @vr25 Did you find any good solution to your response?
I am working on classifying really dong sequence of documents and it seems not working well since the max_sequence limit of 512. I just found this issue and wanted to give it a try.

Hi @oakkas
No, I haven't tried the above solution yet but I will resume this soon. Do you have any updates on this and would like to share here?

Thanks!

Hi again @vr25.
Not yet either. I was pulled into another project for now but hopping to start experimenting soon too.

Hey @vr25 @oakkas

Did any of you try it already?

Hi, I had to do classification of long texts, most of which had 500 - 1000 tokens, but some could contain up to 500k tokens.

So I did a system, heavily inspired by Jakob Devlin's comment above: split my 1024-token text into a minibatch of 2x512 tokens, then concatenated the 2 outputs of the CLS tokens 2 x 768 -> 1536 and put a regular classification head on top of it. Then finetuned the whole system end-to-end.
(The particular implementation was on FlauBERT from huggingface transformers trained in Pytorch.)

Due to very specific nature of my texts (lots of numbers, tables and other structures), I didn't do any striding. So basically, I had no attention span between the 2 individual parts of text.

For my problem this trick gave a meaningful perf gain, but it didn't change the world. The classifier was already doing quite well on truncated texts of 512 tokens, I just managed to push it a bit more. I also tried minibatches of 4x512 tokens. But this didn't give meaningful improvement as only ~5% of my texts were longer than 1024 tokens. These conclusions are task specific, of course.

We also tried to implement a BERT with an attention mechanism of the Longformer and compare with it: https://arxiv.org/abs/2004.05150
But for reasons unknown, the latter took significantly more time to train in our configuration, so we abandoned the idea. Note, however, that this long training is almost certainly not due to the Longformer itself. I'm very curious to see such comparison if someone makes it. :)

@vmaryasin Can you please elaborate why do you believe the long training process is certainly not due to the Longformer itself?

Thanks

@donglinz It'll get a bit messy and historical here. :)

There exist 2 French language BERTS: CamemBERT and FlauBERT with slightly different implementation. At that time, we haven't yet made a final choice of the model for the project. It turned out that due to the way attention layers are coded and accessed in 2 models, it was much easier to implement Longformer on Camembert. On the contrary, for historical reasons the above trick was implemented on Flaubert.

What we observed was that a Longformer took significantly more time to train. But later on we noticed that a large part of this delay was actually due to pure Camembert being slower than pure Flaubert _in_our_particular_setup_. This was the strangest thing and unfortunately, we didn't find the reason why. And as we already had a system which was working alright, we abandoned Camembert and did not try to reimplement Longformer on FlauBERT either.

So it's best to say, that I don't have any conclusion about Longformer. But I wanted to mention it above as it seems to be a good option and I'm curious to see the comparison of the two methods.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wangwei7175878 picture wangwei7175878  路  4Comments

dangal95 picture dangal95  路  3Comments

okgrammer picture okgrammer  路  4Comments

sharavsambuu picture sharavsambuu  路  3Comments

quincyliang picture quincyliang  路  4Comments