Fairseq: Best way to implement BPE dropout?

Created on 23 Mar 2020  ยท  5Comments  ยท  Source: pytorch/fairseq

โ“ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

(Done, didn't see anything similar)

What is your question?

Hi,
I am trying to implement BPE-dropout, using YouTokenToMe or subword-nmt tokenizers. This requires reloading the data in each epoch, each time with slightly different tokenization (achieved by dropping some % of the possible merge operations from the BPE table).

I understand Fairseq assumes that all preprocessing happens prior to training, but is there a reasonable way to change the data during training?

Ideally, generating the new tokens would be done on the fly from raw data stored in RAM, or at least while reading the raw data from disk. But I prefer to avoid changing Fairseq's code and couldn't find a way to do it.

What have you tried?

The only procedure that worked for me was:

  • Preprocess the data using --dataset-impl raw.
  • Generate a new dictionary with the possible tokens of my BPE model.
  • Define a new task that:
    -- Copies the raw data to another directory.
    -- Applies BPE with dropout each time task.get_batch_iterator is called (reading the original data from the backup directory).
  • Start training with fairseq-train data-dir:data-dir, so the data will be read again from disk and not from the existing dataset object.

This works, but it is far from being efficient.
Another concern is that I am not sure if forcing the reloading of the dataset, with "fairseq-train data-dir:data-dir" hack, is a robust solution. Is this behavior likely to change in future versions?

What's your environment?

  • fairseq Version: 0.9.0
  • OS: Linux
  • How you installed fairseq: pip
  • Python version: 3.6
  • GPU models and configuration: Using a single GPU, the model may vary

Thank you very much for your help!

needs triage question

Most helpful comment

@memray Apparently, the inefficiency existed only with the toy datasets I created to write and debug my Fairseq task. Then, the preprocessing of the next epoch's dataset would have delayed the training. But later I used it successfully with full datasets without delaying the training.

I did improve my task, though. I'll share the code next week when I get back to work. The main flow is this:

  • Create a task that inherits the translation task.
  • Create a new argument that points to a directory with the raw data.
  • Create a second directory for the data. You need to run fairseq-train data-dir-1:data-dir-2 so it reloads the datasets from disk.
  • Override load_dataset() so when the train set is loaded from one directory, it starts to preprocess the data into the second directory in another thread. Use the dictionaries from the first directory in fairseq-preprocess.

I think you also have to make sure the preprocessing is done when get_batch_iterator() is called to be 100% safe. I don't remember right now why and how, though.

All 5 comments

@shaywh I wonder if there's any follow-up with this feature. I would like to implement a similar feature. May I know what bottlenecks cause the inefficiency and whether you have resolved it?

Thanks,

@memray Apparently, the inefficiency existed only with the toy datasets I created to write and debug my Fairseq task. Then, the preprocessing of the next epoch's dataset would have delayed the training. But later I used it successfully with full datasets without delaying the training.

I did improve my task, though. I'll share the code next week when I get back to work. The main flow is this:

  • Create a task that inherits the translation task.
  • Create a new argument that points to a directory with the raw data.
  • Create a second directory for the data. You need to run fairseq-train data-dir-1:data-dir-2 so it reloads the datasets from disk.
  • Override load_dataset() so when the train set is loaded from one directory, it starts to preprocess the data into the second directory in another thread. Use the dictionaries from the first directory in fairseq-preprocess.

I think you also have to make sure the preprocessing is done when get_batch_iterator() is called to be 100% safe. I don't remember right now why and how, though.

@memray
I created a basic template for training with different augmentations. You can find it here:
https://github.com/shaywh/augmented-translation

Haven't had time to test it yet, but it implements the flow I described above. I hope you'll find it useful :)

Thank you so much @shaywh ! Does the single thread suffice the on-the-fly processing? Is it necessary to extend it to multi-thread?

You're welcome @memray
For me, the single thread worked fine. Pre-processing a sentence is usually faster than passing it through a neural network.
Admittedly, I mostly used the _preprocess method to call YouTokenToMe (multi-threaded, very fast) and fairseq-process (also multi-threaded) via subprocess. But changing from single to multi-threading, if you need it, should be easy enough.

Was this page helpful?
0 / 5 - 0 ratings