Fairseq: Training on TPU

Created on 20 Aug 2019 · 9Comments · Source: pytorch/fairseq

Is it possible to train RoBERTa Using TPUs?

enhancement

Source

peregilk

👍11

Most helpful comment

Coming soon

myleott on 31 Jan 2020

❤4

All 9 comments

Not that we've tried. Looks like there might be external resources available though.

lematt1991 on 16 Dec 2019

Coming soon

myleott on 31 Jan 2020

❤4

What's the current status on this?

scheiblr on 4 May 2020

@lematt1991 do you know which code/repo/branch lies under the external resources link you posted?

scheiblr on 6 May 2020

You can use the tpu branch: https://github.com/pytorch/fairseq/tree/tpu

We'll be merging the relevant flags to master very soon (basically just --tpu --distributed-world-size=XX should work). We've achieved good results for RoBERTa training, but more work is needed for translation, since TPUs don't like dynamic shapes and we need to modify our batching approach to limit the number of unique batch shapes.

myleott on 12 May 2020

🎉1

Thank you very much. Are all BPE algorithms supported as in the main branch, like sentencepiece, huggingface bpe etc.?

Which value do we have to select for distributed-world-size? We plan to train on a TPU pod.

scheiblr on 12 May 2020

--distributed-world-size should match the size of the TPU pod.

I typically do:

python -m torch_xla.distributed.xla_dist \
  --tpu=myleott-tpu-64 \
  --conda-env=torch-xla-nightly \
  -- \
  python /mnt/myleott/fairseq-py/train.py \
  --distributed-world-size 64 \
  --tpu \
  --log-format json --log-interval 25 \
  (...)

Note that on TPUs we only synchronize logging stats at the end of the logging interval (--log-interval). Smaller values will slow things down a bit, so I recommend using a log interval 25 (or greater). Also note that the words-per-second counter will be off by the same factor. So if you have --log-interval=25, then you should multiply the wps counter by 25 to get the correct value.

myleott on 13 May 2020

Initial support merged in 775122950d145382146e9120308432a9faf9a9b8. I'll be iterating on this over coming weeks to add a README, support for translation tasks (currently only LM and RoBERTa are supported), hopefully support for model parallel, etc.

myleott on 19 May 2020

Thanks a lot!
Any tips yet how to setup --update-freq and --max-sentences with TPUs (esp. when training on pods)?

scheiblr on 23 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings