Do you want to wait for a stable release for the torch-xla? Also shouldn't the end user pass in "(num_tpu_cores=8)" when they are creating the lightning's trainer. It should automatically handle the rest for us i think. Also we have "bloat16" for TPUs as well.
It totally could work out of the box. In which case this issue could be as simple as running a shell command on a tpu machine, seeing that it works well, and then checking in the working command or commenting here :).
Alright cool! Let's see if I can make an attempt to do it! Very new to this examples section of hugging face.
Hey! So i tried running it on the colab first, but it seems there's some error wrt bart model. The traceback it threw, (another issue open as well https://github.com/huggingface/transformers/issues/5915).
To reproduce the same, I executed the finetune.sh with
!sh finetune.sh \
--data_dir /content/xsum \
--model_name_or_path facebook/bart-base \
--output_dir=xsum_results \
--train_batch_size=2 \
--eval_batch_size=2 \
--num_train_epochs 1 \
--n_tpu_cores 8 \
--tpu_cores 8
And modified the args in the shell snip when we invoke the python finetune.py (removed fp16 and gpus to 0)
Attempted to call `variable.set_data(tensor)`, but `variable` and `tensor` have incompatible tensor type.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 222, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1196, in run_pretrain_routine
False)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 293, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 470, in evaluation_forward
output = model.validation_step(*args)
File "finetune.py", line 145, in validation_step
return self._generative_step(batch)
File "finetune.py", line 176, in _generative_step
decoder_start_token_id=self.decoder_start_token_id,
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/generation_utils.py", line 248, in generate
if self.get_output_embeddings() is None:
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bart.py", line 1113, in get_output_embeddings
return _make_linear_from_emb(self.model.shared) # make it on the fly
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bart.py", line 190, in _make_linear_from_emb
lin_layer.weight.data = emb.weight.data
Makes sense. You could try to instantiate
self.lm_head = _make_linear_from_emb(self.model.shared)
in BartForConditionalGeneration.__init__
and then have get_output_embeddings return self.lm_head.
Alright! Thanks; Will keep you posted 馃帀;
Edit -1:
So things seems to work after i make the changes you said. The trouble is that training seems to be frozen on TPUs.
emb.weight.data is...
tensor([[-0.0370, 0.1117, 0.1829, ..., 0.2054, 0.0578, -0.0750],
[ 0.0055, -0.0049, -0.0069, ..., -0.0030, 0.0038, 0.0087],
[-0.0448, 0.4604, -0.0604, ..., 0.1073, 0.0310, 0.0477],
...,
[-0.0138, 0.0278, -0.0467, ..., 0.0455, -0.0265, 0.0125],
[-0.0043, 0.0153, -0.0567, ..., 0.0496, 0.0108, -0.0099],
[ 0.0053, 0.0324, -0.0179, ..., -0.0085, 0.0223, -0.0020]],
device='xla:1')
...and lin_layer.weight.data is...
tensor([[-1.0449e-03, 4.0973e-03, -9.7727e-04, ..., 8.2363e-04,
-3.2153e-03, 3.5317e-03],
[ 2.3644e-03, 3.5527e-03, -1.2428e-03, ..., -1.0983e-04,
-2.1916e-03, 5.3099e-05],
[-4.2492e-03, 3.8183e-04, 3.2527e-03, ..., -4.4359e-03,
7.6555e-04, -4.1728e-03],
...,
[-4.3412e-03, 2.8537e-03, 7.9720e-04, ..., 2.9499e-03,
2.6357e-03, -3.5283e-03],
[ 3.7042e-03, -3.0546e-03, 3.9206e-03, ..., -2.3771e-03,
4.3551e-03, 1.1703e-04],
[ 3.5616e-03, -3.1224e-03, 1.3898e-03, ..., -2.1096e-05,
5.4077e-04, 1.6183e-03]])
... and you try to do...
lin_layer.weight.data = emb.weight.data
Isn't the problem that they are on different devices?
We can remove the thing which does it on fly(refer to sshleifer comment above). That won't work in case of TPUs.
Good to know. Once you get everything working it would be great to have all the required changes consolidated into one PR.
I am not quite sure that lightning does this optimization when we use multiple TPU cores (only available at the nightly-xla's). Refer here.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Makes sense. You could try to instantiate
in BartForConditionalGeneration.__init__
and then have
get_output_embeddingsreturnself.lm_head.