Transformers: TPU Trainer's PerDeviceLoader has no len()

Created on 8 May 2020  路  3Comments  路  Source: huggingface/transformers

馃悰 Bug

The new TPU Trainer fails when running on a single TPU core in a colab notebook since the XLA ParallelLoader's PerDeviceLoader does not implement a length method.

Information

Model I am using (Bert, XLNet ...): Bert

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

To reproduce

I used a custom script, but I would imagine it's not hard to reproduce using one of the example scripts. The error trace is below, with the relevant line in bold:

Exception in device=TPU:0: object of type 'PerDeviceLoader' has no len() Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn fn(gindex, *args) File "<ipython-input-6-3aa4c6105066>", line 264, in main trainer.train() File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 286, in train **t_total = int(len(train_dataloader) // self.args.gradient_accumulation_steps *** self.args.num_train_epochs) TypeError: object of type 'PerDeviceLoader' has no len() An exception has occurred, use %tb to see the full traceback.

Expected behavior

Based on the conversation in the issue linked below, the PerDeviceLoader does not implement length, so the Trainer would need to be aware of that and get the length another way.

https://github.com/pytorch/xla/issues/1191

Environment info

  • transformers version: 2.9.0
  • Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.0a0+ab660ae (False)
  • Tensorflow version (GPU?): 2.2.0-rc4 (False)
  • Using GPU in script?: Using TPU
  • Using distributed or parallel set-up in script?: XLA is used by the Trainer, but it's a single core TPU in a colab notebook

Most helpful comment

Yeah it's recently been added, so if you could select a newer version in the first cell of your colab notebook (env-setup.py script), it should do.

All 3 comments

@jysohn23 (and @dlibenzi) might want to chime in on this but I think this works in xla's nightly from https://github.com/pytorch/xla/pull/1991

(We might still want to improve things but if this is blocking you, please try using the nightly build of xla)

Yeah it's recently been added, so if you could select a newer version in the first cell of your colab notebook (env-setup.py script), it should do.

Thanks for the quick response. Confirmed that using the "nightly" build of XLA works, so closing this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HansBambel picture HansBambel  路  3Comments

fabiocapsouza picture fabiocapsouza  路  3Comments

0x01h picture 0x01h  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

siddsach picture siddsach  路  3Comments