Transformers: 🐛 TFTrainer not working on TPU (TF2.2)

Created on 15 Jun 2020 · 3Comments · Source: huggingface/transformers

🐛 Bug

Information

The problem arises when using:

[ ] the official example scripts
[x] my own modified scripts

The tasks I am working on is:

[x] an official GLUE/SQUaD task: CNN/DM
[ ] my own task or dataset

To reproduce

Steps to reproduce the behavior:

Install transformers from master
Run TPU training using TFTrainer

I get the following error :

TypeError: Failed to convert object of type to Tensor. Contents: . Consider casting elements to a supported type.

Here :
https://github.com/huggingface/transformers/blob/9931f817b75ecb2c8bb08b6e9d4cbec4b0933935/src/transformers/trainer_tf.py#L324

we pass optimizer as arguments.

But according to the documentation in TF :
https://github.com/tensorflow/tensorflow/blob/2b96f3662bd776e277f86997659e61046b56c315/tensorflow/python/distribute/distribute_lib.py#L890-L891

All arguments in args or kwargs should either be nest of tensors or
tf.distribute.DistributedValues containing tensors or composite tensors.

Environment info

transformers version: 2.11.0
Platform: Linux-4.9.0-9-amd64-x86_64-with-debian-9.12
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0 (False)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: TPU training

TensorFlow wontfix

Source

astariul-colanim

Most helpful comment

Hello !

Nice finding! TPUs with TF Trainer is currently under developement and not works for several cases. If you really need to train your model with TPUs I suggest you to use the PyTorch version of the trainer. Full support of TPUs for the TF Trainer will arrive I hope this month.

But if you are ready to make PRs, you are welcome to do so :)

jplu on 16 Jun 2020

👍3

All 3 comments

Currently as a work-around I set the optimizer as an attribute and remove the argument :

After this line :
https://github.com/huggingface/transformers/blob/9931f817b75ecb2c8bb08b6e9d4cbec4b0933935/src/transformers/trainer_tf.py#L235

I add :

self.optimizer = optimizer

And replace the argument optimizer :

https://github.com/huggingface/transformers/blob/9931f817b75ecb2c8bb08b6e9d4cbec4b0933935/src/transformers/trainer_tf.py#L326-L335

    def _step(self):
        """Applies gradients and resets accumulation."""
        gradient_scale = self.gradient_accumulator.step * self.args.strategy.num_replicas_in_sync
        gradients = [
            gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients
        ]
        gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]

        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))
        self.gradient_accumulator.reset()

And finally replace the call :

https://github.com/huggingface/transformers/blob/9931f817b75ecb2c8bb08b6e9d4cbec4b0933935/src/transformers/trainer_tf.py#L324

self.args.strategy.experimental_run_v2(self._step)

Not closing as it's only a work-around. Any cleaner solution to put in a PR ?

astariul-colanim on 15 Jun 2020

👍2

Hello !

But if you are ready to make PRs, you are welcome to do so :)

jplu on 16 Jun 2020

👍3

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.