Allennlp: Any plans for a generic Decoder API for seq2seq models?

Created on 25 Nov 2018  路  18Comments  路  Source: allenai/allennlp

I am primarily working on seq2seq models. I currently use a hack-ish way of importing fairseq into an allennlp model and wrapping some objects to connect the APIs. I thought I could spend some time to implement a pluggable seq2seq model which can use multiple encoder/decoders.

Our current basic seq2seq model has an LSTMCell hardcoded inside of it, which makes it unusable for cases like a Gated CNN decoder, transformer decoder etc.

  1. Would it make sense to abstract out decoder as something separate from the seq2seq model?

  2. Any opinion on what the abstract decoder API should consist of? since there are some specialized attributes with regards to the inputs required by the decoders. Some of these "fancy" decoders need extra values other than Seq2Seq's output, e.g, GatedCNN requires outputs added and scaled with source embedding and without adding it, etc.

  3. Should the current implemented Seq2Seq encoders return additional values that are needed by the decoder in a dictionary?

With regards to the attention mechanism, since it differs for Transformer, GatedCNN etc I guess each decoder deriving the abstract class can figure that out.

Most helpful comment

Hi @sai-prasanna and @matt-gardner ,

I am primarily working on seq2seq models. I currently use a hack-ish way of importing fairseq into an allennlp model and wrapping some objects to connect the APIs. I thought I could spend some time to implement a pluggable seq2seq model which can use multiple encoder/decoders.

turns out I am doing exactly the same thing!

I am also working on seq2seq models while importing and wrapping fairseq modules so that they can be used within allennlp!


I like the idea of extending allennlp's support for seq2seq models a lot! The question is either we want to implement main architectures (transformer, conv2conv, ...) using allennlp modules from scratch, or rely on third-party seq2seq lib like fairseq.


However, both approaches have pros and cons.

At this point, I am happy with what I have because I can just import fairseq's module (seq2seq encoder/decoder/the whole model) and rely on fairseq's developers to maintain the code for these modules. Moreover, I get their inference functionality (BeamSearch, etc) off-the-shelf as well (which can be effectively used inside allennlp's models and predictors). Definitely, I will stick with it at least until NAACL deadline...

Nevertheless, despite it is good for my research project goals, integrating this kind of approach in the allennlp has the following potential downsides:
1) One more dependancy for allennlp.
2) Limited flexibility. Concretely, FairseqEncoder takes as an input tokens indexed only with SingleIdTokenIndexer, which means we have to sacrifice some nice allennlp's features.
3) We will have to submit some PRs to fairseq as well to make it more friendly to be used as a library. There is no guarantee that they will be merged thou...
For example, FaireseqTransformer takes as an input "args" object (with all model's arguments), which is then passed directly to the subsequent modules that only use some of its attributes.
4) Code duplication. Many parts of code we will import are already implemented in allennlp (e.g. StackedSelfAttention, not exactly the same code, but simial)

Potential workarounds:
1) We might make this dependency optional for those who need it
2, 3) Both can be avoided by overwriting Fairseq's modules constructors. Also, we might end up having something like Pytorch Seq2Seq Wrapper, but "Fairseq Seq2Seq Wrapper".
4) Unavoidable

Designing API and reimplementing desired seq2seq models from scratch using allennlp's modules will effectively solve all the problems of the first approach. The downsides are that it is less maintainable, needs actual models to be trained to verify implementation, and gives us less functionality (since in the first case we get literally all fairseq models off-the-shelf).

Alternatively, there is also OpenNMT-py which might be easier integrated with allennlp, but is less mature the fairseq.

Finally, I am tagging @epwalsh since I suppose he might be interested in this discussion as well.

What do you think @sai-prasanna , @matt-gardner, @epwalsh ?

All 18 comments

I think it'd be great to have better seq2seq functionality implemented in AllenNLP, and it's not something any of us are working on, so contributions would be great here. I think the first step would be to write a simple design document for what abstractions you're thinking of adding and how they interact with each other and the other abstractions in the library (key word "simple" - this doesn't have to be anything fancy). That could either be posted directly here, or as a gist, or google doc, or somewhere that we could look at it and comment on it.

Hi @sai-prasanna and @matt-gardner ,

I am primarily working on seq2seq models. I currently use a hack-ish way of importing fairseq into an allennlp model and wrapping some objects to connect the APIs. I thought I could spend some time to implement a pluggable seq2seq model which can use multiple encoder/decoders.

turns out I am doing exactly the same thing!

I am also working on seq2seq models while importing and wrapping fairseq modules so that they can be used within allennlp!


I like the idea of extending allennlp's support for seq2seq models a lot! The question is either we want to implement main architectures (transformer, conv2conv, ...) using allennlp modules from scratch, or rely on third-party seq2seq lib like fairseq.


However, both approaches have pros and cons.

At this point, I am happy with what I have because I can just import fairseq's module (seq2seq encoder/decoder/the whole model) and rely on fairseq's developers to maintain the code for these modules. Moreover, I get their inference functionality (BeamSearch, etc) off-the-shelf as well (which can be effectively used inside allennlp's models and predictors). Definitely, I will stick with it at least until NAACL deadline...

Nevertheless, despite it is good for my research project goals, integrating this kind of approach in the allennlp has the following potential downsides:
1) One more dependancy for allennlp.
2) Limited flexibility. Concretely, FairseqEncoder takes as an input tokens indexed only with SingleIdTokenIndexer, which means we have to sacrifice some nice allennlp's features.
3) We will have to submit some PRs to fairseq as well to make it more friendly to be used as a library. There is no guarantee that they will be merged thou...
For example, FaireseqTransformer takes as an input "args" object (with all model's arguments), which is then passed directly to the subsequent modules that only use some of its attributes.
4) Code duplication. Many parts of code we will import are already implemented in allennlp (e.g. StackedSelfAttention, not exactly the same code, but simial)

Potential workarounds:
1) We might make this dependency optional for those who need it
2, 3) Both can be avoided by overwriting Fairseq's modules constructors. Also, we might end up having something like Pytorch Seq2Seq Wrapper, but "Fairseq Seq2Seq Wrapper".
4) Unavoidable

Designing API and reimplementing desired seq2seq models from scratch using allennlp's modules will effectively solve all the problems of the first approach. The downsides are that it is less maintainable, needs actual models to be trained to verify implementation, and gives us less functionality (since in the first case we get literally all fairseq models off-the-shelf).

Alternatively, there is also OpenNMT-py which might be easier integrated with allennlp, but is less mature the fairseq.

Finally, I am tagging @epwalsh since I suppose he might be interested in this discussion as well.

What do you think @sai-prasanna , @matt-gardner, @epwalsh ?

For what it鈥檚 worth, I think implementing decoder abstractions into AllenNLP would be a good addition. I鈥檇 be willing to help out if that鈥檚 the route we choose and someone can provide a good overview of the desired API.

@maxdel I concur with the pros and cons. I wish we can convince fairseq team to use allennlp, but that might mostly not work out. My need currently is to try out stuff like elmo in encoder which as you pointed out doesn't work well with fairseq's encoder.

Points to keep in mind while designing the API for decoding

  1. Make it play well with caching for incremental decoding, maybe we can lift a page out of fairseq's way of passing the caching dictionary in each decoding step, instead of stateful properties inside decoder like opennmt

  2. Make it flexible (like opennmt , unlike fairseq) so that one can potentially combine different encoders with different decoders.

  3. Keep beam search in mind.

btw can you share the code, to compare notes on how you wrapped things, I do an ugly hack to replace fairseq's pytorch embedding with allennlp's Embedding object, and a wrapper for dictionary.

@sai-prasanna @maxdel if you are interested, maybe we could get the ball rolling on a document for the proposed API? GoogleDoc or whatever is easiest

@epwalsh @sai-prasanna @maxdel I am also interested in contributing to this as I have been working with fairseq models and would like to make porting them to allennlp easier. I just started a blank GoogleDoc where we can discuss ideas about the proposed API. If you prefer another format let me know.

Sorry to be slow responding, and thanks for going ahead with this. When you all are comfortable with the state of the google doc, we'll give it a design review.

One thing to think about: we have a general encoder/decoder architecture already implemented, but it's designed for state machines, where the assumption is that you have different allowed actions at every timestep, and there's a lot of overhead for dealing with that. Is there any room to combine these two things? I'm guessing that's more work than it's worth, because it's just extra overhead that will slow down more typical seq2seq use cases, but it's worth thinking about a bit, at least.

Oh, another thing to keep in mind: how do you want to handle different training regimes? Like beam search optimization, other RL-inspired techniques, and so on?

@Alex-Fabbri @maxdel @epwalsh - I started charting out stuff in that doc. Feel free to comment/edit as required.

Hi @sai-prasanna

btw can you share the code, to compare notes on how you wrapped things, I do an ugly hack to replace fairseq's pytorch embedding with allennlp's Embedding object, and a wrapper for dictionary.

I also wrap the dictionary and don't even wrap an Embedding object so the code is probably about the same.

Hi, I am working on related NLP problems as a side-project and I am interested in possibility to contribute to AllenNLP. I am wondering if there are any progress on this since last comment here. It looks like a large modification required to implement this issue, so maybe a little coordination required. Is there any fork with initial attempts or something like this?

As far as I鈥檓 aware there hasn鈥檛 been a fork started yet, but I think it might be a good time to get that going. That would at least help get the ball rolling. Maybe we could start working off of a separate branch here?

Hi all, Sorry, was busy for the past couple of weeks. @epwalsh It would be great if you start the ball rolling. Saw your cool repo on copynets, I think you would have a better idea on how to start off the API spec.

Maybe we implement a generic decoder API, and an RNN implementation + beam search and attention. Then cover the remaining implementations like transformer decoder and CNN decoder.

@sai-prasanna I don't mind starting it off, unless one of the maintainers here can create a new branch for us. In any case, I probably won't be able to get to this started until after new years. So I'll start it off then unless someone else takes it up before that. Happy holidays!

If you want to do it on a branch here, that's fine with me. The pro is that you'd get more frequent feedback, because each PR needs to be reviewed and approved by us, and then there isn't some massive PR at the end that has to be reviewed all together. The con is that we might become a bottleneck.

Also, I was waiting for someone to let me know that that document was ready to be looked at for a design review. Should I look at it now? (Where "now" means "when I'm back from vacation on January 2")

@matt-gardner You could take a look, but sadly we didn't conclude with a consensus on an API design covering all that's needed in the document discussion (https://docs.google.com/document/d/1tr_q0Czal4xdMWzwd0K-Tf3mBCk-_tckKcZtdzIM4P0). I think that a good design (for the decoder) would evolve once we start implementing the RNN, CNN, Transformer backends for the abstract decoder.

@epwalsh After new year is fine. Enjoy the holidays!.

I definitely want to avoid one huge PR, so I think the most efficient way forward is for us to work on our own fork (which I, or someone else can start), and then incrementally submit PRs here as we go. Like @sai-prasanna said, I think a good first step (and first PR) would be to implement a generic base decoder class with some of the features we talked about in the doc (beam search, attention, etc), which we can then use to refactor the simple_seq2seq model. Subsequent PRs can work on improving the base decoder and making it even more general. Hopefully it can eventually be used in the semantic parsing framework as well (as @matt-gardner mentioned above), but I think that would be too big of a change / too much to think about for the first PR.

@epwalsh that sounds like a good plan. If you'd like feedback on your fork at any point before submitting the PRs here, just ping me, and I'll take a look.

Ok so I've set up a branch on my fork called "decoder" and started a PR into that branch that sets up some basic structure. I figured our workflow could go like this:

  1. we all contribute through PRs or direct commits to the "decoder" branch on my fork, and we can continue the discussion of the details of the API and implementation on those PRs,
  2. when we are satisfied with the progress, we'll create an official PR into the master branch here.

I'm not 100% sure what the easiest way is for ya'll to work off of the decoder branch on my fork. One way would be to add the url for my fork as a remote to your local fork with something like this:

# Add remote via https:
git remote add epwalsh-fork https://github.com/epwalsh/allennlp

# or via ssh:
git remote add epwalsh-fork [email protected]:epwalsh/allennlp.git

Then you can create your own branch off of the decoder branch like this:

# pull latest updates
git fetch epwalsh-fork

# create new local branch from decoder branch
git checkout -b decoder-updates epwalsh-fork/decoder

# push your branch to GitHub
git push --set-upstream origin decoder-updates 

If you know of a better way let me know!

Was this page helpful?
0 / 5 - 0 ratings