Transformers: examples/seq2seq/finetune.py and BART supports TPU

Created on 20 Jul 2020  路  11Comments  路  Source: huggingface/transformers

  • [ ] test the code on tpu
  • [ ] if it doesn't work well: change code as little as possible to get it working.
  • [ ] add a script/command that works
Help wanted wontfix

Most helpful comment

Makes sense. You could try to instantiate

self.lm_head =  _make_linear_from_emb(self.model.shared) 

in BartForConditionalGeneration.__init__
and then have get_output_embeddings return self.lm_head.

All 11 comments

Do you want to wait for a stable release for the torch-xla? Also shouldn't the end user pass in "(num_tpu_cores=8)" when they are creating the lightning's trainer. It should automatically handle the rest for us i think. Also we have "bloat16" for TPUs as well.

It totally could work out of the box. In which case this issue could be as simple as running a shell command on a tpu machine, seeing that it works well, and then checking in the working command or commenting here :).

Alright cool! Let's see if I can make an attempt to do it! Very new to this examples section of hugging face.

Hey! So i tried running it on the colab first, but it seems there's some error wrt bart model. The traceback it threw, (another issue open as well https://github.com/huggingface/transformers/issues/5915).

To reproduce the same, I executed the finetune.sh with

!sh finetune.sh \
--data_dir /content/xsum \
--model_name_or_path facebook/bart-base \
--output_dir=xsum_results \
--train_batch_size=2 \
--eval_batch_size=2 \
--num_train_epochs 1 \
--n_tpu_cores 8 \
--tpu_cores 8

And modified the args in the shell snip when we invoke the python finetune.py (removed fp16 and gpus to 0)

Attempted to call `variable.set_data(tensor)`, but `variable` and `tensor` have incompatible tensor type.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 222, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1196, in run_pretrain_routine
    False)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 293, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 470, in evaluation_forward
    output = model.validation_step(*args)
  File "finetune.py", line 145, in validation_step
    return self._generative_step(batch)
  File "finetune.py", line 176, in _generative_step
    decoder_start_token_id=self.decoder_start_token_id,
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/generation_utils.py", line 248, in generate
    if self.get_output_embeddings() is None:
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bart.py", line 1113, in get_output_embeddings
    return _make_linear_from_emb(self.model.shared)  # make it on the fly
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bart.py", line 190, in _make_linear_from_emb
    lin_layer.weight.data = emb.weight.data

Makes sense. You could try to instantiate

self.lm_head =  _make_linear_from_emb(self.model.shared) 

in BartForConditionalGeneration.__init__
and then have get_output_embeddings return self.lm_head.

Alright! Thanks; Will keep you posted 馃帀;

Edit -1:

So things seems to work after i make the changes you said. The trouble is that training seems to be frozen on TPUs.

emb.weight.data is...

tensor([[-0.0370,  0.1117,  0.1829,  ...,  0.2054,  0.0578, -0.0750],
        [ 0.0055, -0.0049, -0.0069,  ..., -0.0030,  0.0038,  0.0087],
        [-0.0448,  0.4604, -0.0604,  ...,  0.1073,  0.0310,  0.0477],
        ...,
        [-0.0138,  0.0278, -0.0467,  ...,  0.0455, -0.0265,  0.0125],
        [-0.0043,  0.0153, -0.0567,  ...,  0.0496,  0.0108, -0.0099],
        [ 0.0053,  0.0324, -0.0179,  ..., -0.0085,  0.0223, -0.0020]],
       device='xla:1')

...and lin_layer.weight.data is...

tensor([[-1.0449e-03,  4.0973e-03, -9.7727e-04,  ...,  8.2363e-04,
         -3.2153e-03,  3.5317e-03],
        [ 2.3644e-03,  3.5527e-03, -1.2428e-03,  ..., -1.0983e-04,
         -2.1916e-03,  5.3099e-05],
        [-4.2492e-03,  3.8183e-04,  3.2527e-03,  ..., -4.4359e-03,
          7.6555e-04, -4.1728e-03],
        ...,
        [-4.3412e-03,  2.8537e-03,  7.9720e-04,  ...,  2.9499e-03,
          2.6357e-03, -3.5283e-03],
        [ 3.7042e-03, -3.0546e-03,  3.9206e-03,  ..., -2.3771e-03,
          4.3551e-03,  1.1703e-04],
        [ 3.5616e-03, -3.1224e-03,  1.3898e-03,  ..., -2.1096e-05,
          5.4077e-04,  1.6183e-03]])

... and you try to do...

lin_layer.weight.data = emb.weight.data

Isn't the problem that they are on different devices?

We can remove the thing which does it on fly(refer to sshleifer comment above). That won't work in case of TPUs.

Good to know. Once you get everything working it would be great to have all the required changes consolidated into one PR.

I am not quite sure that lightning does this optimization when we use multiple TPU cores (only available at the nightly-xla's). Refer here.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings