The new TPU Trainer fails when running on a single TPU core in a colab notebook since the XLA ParallelLoader's PerDeviceLoader does not implement a length method.
Model I am using (Bert, XLNet ...): Bert
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
I used a custom script, but I would imagine it's not hard to reproduce using one of the example scripts. The error trace is below, with the relevant line in bold:
Exception in device=TPU:0: object of type 'PerDeviceLoader' has no len()
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "<ipython-input-6-3aa4c6105066>", line 264, in main
trainer.train()
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 286, in train
**t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps *** self.args.num_train_epochs)
TypeError: object of type 'PerDeviceLoader' has no len()
An exception has occurred, use %tb to see the full traceback.
Based on the conversation in the issue linked below, the PerDeviceLoader does not implement length, so the Trainer would need to be aware of that and get the length another way.
https://github.com/pytorch/xla/issues/1191
transformers version: 2.9.0@jysohn23 (and @dlibenzi) might want to chime in on this but I think this works in xla's nightly from https://github.com/pytorch/xla/pull/1991
(We might still want to improve things but if this is blocking you, please try using the nightly build of xla)
Yeah it's recently been added, so if you could select a newer version in the first cell of your colab notebook (env-setup.py script), it should do.
Thanks for the quick response. Confirmed that using the "nightly" build of XLA works, so closing this issue.
Most helpful comment
Yeah it's recently been added, so if you could select a newer version in the first cell of your colab notebook (
env-setup.pyscript), it should do.