Hi fairseq team!
I'm working on extending fairseq with my custom architecture.
Basically I created a custom Dictionary subclass (supporting bpe-encoding), and I want to add my custom TranslationTask subclass that knows how to load such model.
I also want to be able to declare my custom architecture.
At the moment I don't see a way to do that without changing the code (preprocess.py, train.py, interactive.py ecc..). The main problems I see are two:
So I would like to ask if there is some option I missed and this is actually already possible or, if not, would you accept a pull request enabling this feature?
At the moment I ended up cloning all python scripts in fairseq home folder.. and this is definitely a bad idea!
Thanks!
Interesting! In the past we've actually recommended forking train.py and putting the imports there, but I like this --user-dir idea; a PR for this would be great! Otherwise, we're planning to do a kind of fairseq fix/hackathon thing at the end of the month and we can build it then.
Re: preprocess script and asking the task, I'm a bit uneasy about adding more complexity to the task API. Might it be simpler to just make this configurable in preprocess.py directly? Alternatively, if your changes to Dictionary are general enough could we just pull them into the existing base class?
Cool, I will work on this --user-dir and send you a pull request asap! I would also like to edit the setup script you have, in order to let preprocess.py and train.py be globally available in the cli, just like t2t-train and t2t-datagen. This way once installed fairseq, we could simply run something like:
fairseq-preprocess -a ... -b ...
fairseq-train -a ... -b ...
Without the need to keep the fairseq github repo somewhere. Does it sounds reasonable to you?
On the preprocess, I understand that the "change" is not that immediate. Let me try to explain my point. The current implementation make the assumption that the input text (for both training and decoding) is already tokenized even with bpe. On top of that you have a strict definition of what a Dictionary is, on how to train it and how it stores its files.
The problem with this is that is very rigid, and does not allow to choose a different tokenization/numerization. For example a bpe model has its own tokens and its own token -> id map, and it is not possible to use it with fairseq. A more elastic structure could be something more similar to what t2t does:
what do you think about it?
@myleott I just sent a proposal with Pull Request https://github.com/pytorch/fairseq/pull/448.
Basically I just moved the loading and building code for Dictionary from preprocess.py to TranslationTask. This way, if I want to create my custom Task with a specific Dictionary, I still can use the preprocess.py script instead of having to create my own.
Please let me know if this change is acceptable or if I need to change something!
Thanks for driving these improvements @davidecaroselli! Btw, we have a new engineer who is looking into improving support for character-level models, who is now also looking at merging Tokenizer into Dictionary like we discussed in the other thread. Closing this for now, but I鈥檒l keep you updated.
Thanks @myleott for the amazing support!
Most helpful comment
Thanks for driving these improvements @davidecaroselli! Btw, we have a new engineer who is looking into improving support for character-level models, who is now also looking at merging Tokenizer into Dictionary like we discussed in the other thread. Closing this for now, but I鈥檒l keep you updated.