(Done, didn't see anything similar)
Hi,
I am trying to implement BPE-dropout, using YouTokenToMe or subword-nmt tokenizers. This requires reloading the data in each epoch, each time with slightly different tokenization (achieved by dropping some % of the possible merge operations from the BPE table).
I understand Fairseq assumes that all preprocessing happens prior to training, but is there a reasonable way to change the data during training?
Ideally, generating the new tokens would be done on the fly from raw data stored in RAM, or at least while reading the raw data from disk. But I prefer to avoid changing Fairseq's code and couldn't find a way to do it.
The only procedure that worked for me was:
This works, but it is far from being efficient.
Another concern is that I am not sure if forcing the reloading of the dataset, with "fairseq-train data-dir:data-dir" hack, is a robust solution. Is this behavior likely to change in future versions?
Thank you very much for your help!
@shaywh I wonder if there's any follow-up with this feature. I would like to implement a similar feature. May I know what bottlenecks cause the inefficiency and whether you have resolved it?
Thanks,
@memray Apparently, the inefficiency existed only with the toy datasets I created to write and debug my Fairseq task. Then, the preprocessing of the next epoch's dataset would have delayed the training. But later I used it successfully with full datasets without delaying the training.
I did improve my task, though. I'll share the code next week when I get back to work. The main flow is this:
I think you also have to make sure the preprocessing is done when get_batch_iterator() is called to be 100% safe. I don't remember right now why and how, though.
@memray
I created a basic template for training with different augmentations. You can find it here:
https://github.com/shaywh/augmented-translation
Haven't had time to test it yet, but it implements the flow I described above. I hope you'll find it useful :)
Thank you so much @shaywh ! Does the single thread suffice the on-the-fly processing? Is it necessary to extend it to multi-thread?
You're welcome @memray
For me, the single thread worked fine. Pre-processing a sentence is usually faster than passing it through a neural network.
Admittedly, I mostly used the _preprocess method to call YouTokenToMe (multi-threaded, very fast) and fairseq-process (also multi-threaded) via subprocess. But changing from single to multi-threading, if you need it, should be easy enough.
Most helpful comment
@memray Apparently, the inefficiency existed only with the toy datasets I created to write and debug my Fairseq task. Then, the preprocessing of the next epoch's dataset would have delayed the training. But later I used it successfully with full datasets without delaying the training.
I did improve my task, though. I'll share the code next week when I get back to work. The main flow is this:
I think you also have to make sure the preprocessing is done when get_batch_iterator() is called to be 100% safe. I don't remember right now why and how, though.