Transformers: [Reformer/Longformer] Add cross-attention layers for Encoder-Decoder setting

Created on 8 May 2020  ·  16Comments  ·  Source: huggingface/transformers

❓ Questions & Help

Details

It seems that Reformer model has been added in Transformers. But it's a standard auto-regressive language model. How can we use Reformer in Encoder-Decoder settings ?

wontfix

Most helpful comment

hi @patrickvonplaten , I agree with you here. I also wanted to implement Reformer Encoder/Decoder but as no previous experiment are available from original authors and no pre-trained weight are provided I decided to go with LongBart i.e replacing BART's attention with Longformars sliding window attention. @alexgaskell10 has already experimented with it and it seems to be doing good enough. @ibeltagy has also added LongformerEncoderDeocder using BART weights.

As pre-trained weights are already available for Longformer and as the sliding window attention can be used as drop-in replacement for self attention prioritizing Longformer encoder-decoder it makes more sense to me than Reformer

All 16 comments

It's a good question. I originally thought it will be quite easy to add - but it's not going to be that trivial.

For the encoder-decoder setting, we need a lsh cross attention layer that receives different embeddings for query and keys so that the usual LSH hashing method does not work.

It will probably take a while until this is implemented since as far as I know the authors of Reformer themselves are still working on it.

This question could also be a good opportunity to brainstorm a bit on how a LSH Cross Attention layer could look like :-)

Are you referring to https://github.com/google/trax/blob/25479c577ef28780118a4f0169e1c7f5a028a7a3/trax/models/reformer/reformer.py#L846-L847 ?

I assume the approach taken for the encoder-decoder in the machine translation notebook wouldn't work here?

Yeah, for now, we could do the same as is done there and simply add normal "non-chunked" attention for encoder-decoder settings. I'll think about it.

Yeah, for now, we could do the same as is done there and simply add normal "non-chunked" attention for encoder-decoder settings. I'll think about it.

Hi Patrick, any update on this?

Thanks :)

I will take a while until this is implemented. Not sure if @patil-suraj already has an idea of how he would implement it :-)

@patrickvonplaten there's been an update on https://github.com/google/trax/issues/539#issuecomment-647980326

Awesome thanks for noting that!
I try to take a look at this as soon as possible.
Longformer and Reformer Encoder-Decoder models definitely are of high priority :-)

@patrickvonplaten how is the Reformer Encoder Decoder coming along? I would like to compare this model to Longformer Encoder Decoder for long document summarisation so I will probably begin working on this soon if it still looks to be a while away.

@alexgaskell10, if you want to implement it yourself, you can use a regular encoder-decoder model with the LSH selfattention in the encoder only. The other two attention blocks in the decoder (crossattention and final selfattention) can still use the regular full attention. This works when the output length is not very long, which is typically the case. You won't get the memory savings of the reversible transformer, but maybe you don't need it and gradient checkpointing is enough.

I am curious if you already have numbers for LongformerEncoderDecoder that you can share?

@ibeltagy thanks for this. Yes, that is my plan- begin with implementing the self-attention in the encoder and then add the other memory-saving features if necessary. This will probably be fine for my use-case as the encoder is the main memory bottleneck.

I have run lots of experiments for input lengths < 1024 tokens. I am in the process of running more with input len > 1024 but the early results have been strange so I need to do some digging and these aren't ready to share. I'll post them here when they are ready. In the meantime, are you interested in experiments with < 1024 tokens?

I want to give a short update on EncoderDecoder models for Longformer / Reformer from my side.

Given that the Reformer Encoder / Decoder code is still very researchy in teh original trax code-base and thus prone to still change, we will probably wait a bit until we implement Reformer Encoder/Decoder logic for transformers.

From my experience, few people would take the Reformer Encoder/Decoder framework to do massive pre-training from scratch. Most likely, we will wait until the Reformer authors publish weights or it least a paper.

At the moment, I work on fine-tuning a Bert2Bert model via the Encoder/Decoder framework (use two pretrained bert-base-uncased models => EncoderDecoder.from_pretrained("bert-base-cased", "bert-base-cased") and fine-tune the model. This means especially the decoder weights have to be adapted a lot, since in the EncoderDecoder framework the model has a causal mask and the cross attention layers are to be trained from scratch. The results so far are quite promising in that it might not be too difficult to fine-tune two pretrained "encoder-only" models into an encoder-decoder model.

Since Reformer also does not yet have a massively pretrained bi-directional encoder-only model, the focus will most likely shift to the Longformer encoder-decoder framework afterwards. Longformer has more traction, more pre-trained models and is more similar to Bert2Bert fine-tuning.

That's mostly my personal opinion - I would be very interested in your opinions about it!

hi @patrickvonplaten , I agree with you here. I also wanted to implement Reformer Encoder/Decoder but as no previous experiment are available from original authors and no pre-trained weight are provided I decided to go with LongBart i.e replacing BART's attention with Longformars sliding window attention. @alexgaskell10 has already experimented with it and it seems to be doing good enough. @ibeltagy has also added LongformerEncoderDeocder using BART weights.

As pre-trained weights are already available for Longformer and as the sliding window attention can be used as drop-in replacement for self attention prioritizing Longformer encoder-decoder it makes more sense to me than Reformer

Thanks for the update @patrickvonplaten and I think that sounds sensible. Having looked into them both Longformer ED is definitely more straightforward to implement than Reformer ED and in my experiments so far it has performed pretty well so makes sense to focus efforts here.

I had started working on the Reformer ED before your update and I have a working implementation (just the bare bones- replacing Bart's Encoder's self attention layers with the local/lsh layers from Reformer). I am running some early experiments on it at the moment but I don't expect it to perform well as the self-attention layers are trained from scratch. I'll share an update if anything interesting arises.

I work on fine-tuning a Bert2Bert model via the Encoder/Decoder framework

Sounds cool, look forward to trying it out!

Oh I cannit wait for this to be ready. 😍

I have a seq2seq problem with documents which are good enough that I need this to be ready before I can hope to solve my task.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings