Transformers: TPU Trainer's PerDeviceLoader has no len()

Created on 8 May 2020 · 3Comments · Source: huggingface/transformers

🐛 Bug

The new TPU Trainer fails when running on a single TPU core in a colab notebook since the XLA ParallelLoader's PerDeviceLoader does not implement a length method.

Information

Model I am using (Bert, XLNet ...): Bert

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

I used a custom script, but I would imagine it's not hard to reproduce using one of the example scripts. The error trace is below, with the relevant line in bold:

Exception in device=TPU:0: object of type 'PerDeviceLoader' has no len() Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn fn(gindex, *args) File "<ipython-input-6-3aa4c6105066>", line 264, in main trainer.train() File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 286, in train **t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps *** self.args.num_train_epochs) TypeError: object of type 'PerDeviceLoader' has no len() An exception has occurred, use %tb to see the full traceback.

Expected behavior

Based on the conversation in the issue linked below, the PerDeviceLoader does not implement length, so the Trainer would need to be aware of that and get the length another way.

https://github.com/pytorch/xla/issues/1191

Environment info

transformers version: 2.9.0
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0a0+ab660ae (False)
Tensorflow version (GPU?): 2.2.0-rc4 (False)
Using GPU in script?: Using TPU
Using distributed or parallel set-up in script?: XLA is used by the Trainer, but it's a single core TPU in a colab notebook

Source