Is it possible to train RoBERTa Using TPUs?
Not that we've tried. Looks like there might be external resources available though.
Coming soon
What's the current status on this?
@lematt1991 do you know which code/repo/branch lies under the external resources link you posted?
You can use the tpu branch: https://github.com/pytorch/fairseq/tree/tpu
We'll be merging the relevant flags to master very soon (basically just --tpu --distributed-world-size=XX should work). We've achieved good results for RoBERTa training, but more work is needed for translation, since TPUs don't like dynamic shapes and we need to modify our batching approach to limit the number of unique batch shapes.
Thank you very much. Are all BPE algorithms supported as in the main branch, like sentencepiece, huggingface bpe etc.?
Which value do we have to select for distributed-world-size? We plan to train on a TPU pod.
--distributed-world-size should match the size of the TPU pod.
I typically do:
python -m torch_xla.distributed.xla_dist \
--tpu=myleott-tpu-64 \
--conda-env=torch-xla-nightly \
-- \
python /mnt/myleott/fairseq-py/train.py \
--distributed-world-size 64 \
--tpu \
--log-format json --log-interval 25 \
(...)
Note that on TPUs we only synchronize logging stats at the end of the logging interval (--log-interval). Smaller values will slow things down a bit, so I recommend using a log interval 25 (or greater). Also note that the words-per-second counter will be off by the same factor. So if you have --log-interval=25, then you should multiply the wps counter by 25 to get the correct value.
Initial support merged in 775122950d145382146e9120308432a9faf9a9b8. I'll be iterating on this over coming weeks to add a README, support for translation tasks (currently only LM and RoBERTa are supported), hopefully support for model parallel, etc.
Thanks a lot!
Any tips yet how to setup --update-freq and --max-sentences with TPUs (esp. when training on pods)?
Most helpful comment
Coming soon