Allennlp: Remove / move internal transformer implementations

Created on 3 Jan 2019  路  21Comments  路  Source: allenai/allennlp

We currently have a handful of transformer implementations. At least the following:

https://github.com/allenai/allennlp/blob/master/allennlp/modules/seq2seq_encoders/multi_head_self_attention.py
https://github.com/allenai/allennlp/blob/master/allennlp/modules/seq2seq_encoders/bidirectional_language_model_transformer.py#L139
https://github.com/allenai/allennlp/blob/master/allennlp/modules/openai_transformer.py

We should remove these and rely on an external implementation. Perhaps, Pytorch's nn.Transformer. This will (very likely) be a breaking change for our existing models.

2 Weeks Breaking change

Most helpful comment

Ideally we could just use another fully featured pytorch transformer implementation (e.g. https://github.com/pytorch/fairseq or possibly https://github.com/huggingface/pytorch-pretrained-BERT) and not have to worry about maintaining the code in allennlp. fairseq would also give us optimized CNNs and fp16 in addition to self-attention.

All 21 comments

@matt-peters during issue review it came up that you might propose standardizing on the openai transformer (assuming experiments work out). Is that accurate?

Ideally we could just use another fully featured pytorch transformer implementation (e.g. https://github.com/pytorch/fairseq or possibly https://github.com/huggingface/pytorch-pretrained-BERT) and not have to worry about maintaining the code in allennlp. fairseq would also give us optimized CNNs and fp16 in addition to self-attention.

FWIW, looks like the new PyTorch 1.1 includes a torch.nn.MultiheadAttention

Moreover, this module was landed as a part of a bigger Transformer PR

Thanks @maxdel, we have put together a PR for Transformer in PyTorch. Feel free to comment.

The transformer module is supported by torch.nn.MultiheadAttention. We have a plan to improve the efficiency of this operator in the near future. @cpuhrsch.

Hey all, please let us know how we can help you guys deduplicate some of your work and better support this library overall!

We're eager to provide a good set of tools or a full implementation of what you need for your Transformers.

Also, if you have more general feature requests, we'd be eager to hear about them as well.

@cpuhrsch Need some help figuring out the decoder abstraction - https://github.com/allenai/allennlp/pull/2517 .

The new version of pytorch includes a nn.Transformer module

Hello everyone, I just want to check if we could provide any help to apply nn.Transformer module. We want to better support AllenNLP and avoid the duplicate work. We could implement more features if you feel necessary to better serve this repo.

Updated description to reflect that we don't intend to maintain our own transformer implementation anymore.

If you find any problems and difficulties when mitigating to pytorch transformer/multiheadattention, please ping me @zhangguaheng66 on pytorch Github. We are happy to provide some hands-on help.

@zhangguanheng66 fantastic and thanks for reaching back out. I'm pretty sure we'll be doing this in the next two months.

What we should do: remove all of our internal implementations of multi-headed attention / self attention, in favor of huggingface and pytorch implementations. Some of our old models depend on this code, though. As we split out those models into their own repos, our internal implementations should go with them, when it makes sense. If we have any implementations that are not used in one of the models that we are putting into their own repo, they should just be removed.

We opened an issue to keep a record of change requests for MHA and transformer. If you have some implementations what are worth landing in pytorch repo, feel free to send requests there. https://github.com/pytorch/pytorch/issues/32590

Notes to self:

  • BidirectionalLanguageModelTransformer should go with the model.
  • StackedSelfAttentionEncoder can possibly be transparently replaced with a PyTorch implementation if we're careful with the names of the parameters?

I started adding the transformer implementation from PyTorch here: #3898

This is done already.

@dirkgr Nice job for the transformer wrapper.

I saw there is a seq-to-seq encoder MultiHeadSelfAttention. Is there any plan for it?

Despite the name, that module is not strictly a transformer as designed by Devlin et al. Also, it was only used by QaNet, so I moved it there: https://github.com/allenai/allennlp-models/blob/master/allennlp_models/rc/qanet/multi_head_self_attention.py

If you need it, feel free to grab it from there and add it to your own code.

Despite the name, that module is not strictly a transformer as designed by Devlin et al. Also, it was only used by QaNet, so I moved it there: https://github.com/allenai/allennlp-models/blob/master/allennlp_models/rc/qanet/multi_head_self_attention.py

If you need it, feel free to grab it from there and add it to your own code.

Thanks @dirkgr . I'm wondering if you plan to wrap up nn.MultiheadAttention and replace multi_head_self_attention function. To me, they are very similar.

nn.MultiheadAttention is not a seq2seq module. It implements the attention part of the transformer architecture. The pytorch transformer code (and thus our wrapper of it) uses nn.MultiheadAttention internally.

If you need it in your own forward() functions, you can just call it! There is no need for AllenNLP to wrap it.

Was this page helpful?
0 / 5 - 0 ratings