Fairseq: Extend TranslationTask, Dictionary and architecture

Created on 9 Jan 2019 · 5Comments · Source: pytorch/fairseq

Hi fairseq team!

I'm working on extending fairseq with my custom architecture.
Basically I created a custom Dictionary subclass (supporting bpe-encoding), and I want to add my custom TranslationTask subclass that knows how to load such model.
I also want to be able to declare my custom architecture.

At the moment I don't see a way to do that without changing the code (preprocess.py, train.py, interactive.py ecc..). The main problems I see are two:

There is no way to include a custom module that register new architectures or tasks. Something similar to tensor2tensor's --t2t_usr_dir.
The pre-process script make the assumption that the only available Dictionary is the standard one, and it does not ask the task to provide its implementation.

So I would like to ask if there is some option I missed and this is actually already possible or, if not, would you accept a pull request enabling this feature?

At the moment I ended up cloning all python scripts in fairseq home folder.. and this is definitely a bad idea!

Thanks!

Source

davidecaroselli

Most helpful comment

Thanks for driving these improvements @davidecaroselli! Btw, we have a new engineer who is looking into improving support for character-level models, who is now also looking at merging Tokenizer into Dictionary like we discussed in the other thread. Closing this for now, but I’ll keep you updated.

myleott on 6 Feb 2019

👍2

All 5 comments

Interesting! In the past we've actually recommended forking train.py and putting the imports there, but I like this --user-dir idea; a PR for this would be great! Otherwise, we're planning to do a kind of fairseq fix/hackathon thing at the end of the month and we can build it then.

Re: preprocess script and asking the task, I'm a bit uneasy about adding more complexity to the task API. Might it be simpler to just make this configurable in preprocess.py directly? Alternatively, if your changes to Dictionary are general enough could we just pull them into the existing base class?

myleott on 9 Jan 2019

Cool, I will work on this --user-dir and send you a pull request asap! I would also like to edit the setup script you have, in order to let preprocess.py and train.py be globally available in the cli, just like t2t-train and t2t-datagen. This way once installed fairseq, we could simply run something like:

fairseq-preprocess -a ... -b ...
fairseq-train -a ... -b ...

Without the need to keep the fairseq github repo somewhere. Does it sounds reasonable to you?

On the preprocess, I understand that the "change" is not that immediate. Let me try to explain my point. The current implementation make the assumption that the input text (for both training and decoding) is already tokenized even with bpe. On top of that you have a strict definition of what a Dictionary is, on how to train it and how it stores its files.

The problem with this is that is very rigid, and does not allow to choose a different tokenization/numerization. For example a bpe model has its own tokens and its own token -> id map, and it is not possible to use it with fairseq. A more elastic structure could be something more similar to what t2t does:

A Dictionary is an _encoder_ that takes care to transform a raw string into a sequence of numbers and back. More important it is abstract.
The user can provide custom Dictionary implementations through the Task class (that defines how a task is made, so its tokenization/detokenization too).
By default a task uses a base implementation of the Dictionary that has a whitespace tokenizer and the current behaviour.

what do you think about it?

davidecaroselli on 10 Jan 2019

@myleott I just sent a proposal with Pull Request https://github.com/pytorch/fairseq/pull/448.

Basically I just moved the loading and building code for Dictionary from preprocess.py to TranslationTask. This way, if I want to create my custom Task with a specific Dictionary, I still can use the preprocess.py script instead of having to create my own.

Please let me know if this change is acceptable or if I need to change something!

davidecaroselli on 14 Jan 2019

👍1

myleott on 6 Feb 2019

👍2

Thanks @myleott for the amazing support!

davidecaroselli on 7 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings