Transformers: RuntimeError: Gather got an input of invalid size: got [2, 3, 12, 256, 64], but expected [2, 4, 12, 256, 64] (gather at /opt/conda/conda-bld/pytorch_1544199946412/work/torch/csrc/cuda/comm.cpp:227)

Created on 7 Sep 2019  ยท  14Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

Hi,
I am running a modified version of run_lm_finetuning.py, it was working fine and model checkpoints have been saved, until the last step of the first epoch (9677/9678), where I got this error:

Traceback (most recent call last):โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 9677/9678 [2:01:24<00:00,  1.36it/s]
  File "my_run_lm_finetuning.py", line 588, in <module>
    main()
  File "my_run_lm_finetuning.py", line 542, in main
    global_step, tr_loss = train(args, train_dataset, model, bert_model_fintuned, tokenizer, bert_tokenizer)
  File "my_run_lm_finetuning.py", line 260, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, enc_output, labels=labels)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 144, in forward
    return self.gather(outputs, self.output_device)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
    return gather_map(outputs)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [2, 3, 12, 256, 64], but expected [2, 4, 12, 256, 64] (gather at /opt/conda/conda-bld/pytorch_1544199946412/work/torch/csrc/cuda/comm.cpp:227)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f3c52b7fcc5 in /home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::gather(c10::ArrayRef<at::Tensor>, long, c10::optional<int>) + 0x4d8 (0x7f3c936eaba8 in /home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x4f99de (0x7f3c936ed9de in /home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x111e36 (0x7f3c93305e36 in /home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #14: THPFunction_apply(_object*, _object*) + 0x5dd (0x7f3c9350140d in /home/anaconda3/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Note that in this experiment I used a fine-tuned version of Bert (I fine-tuned it using your previous script in lm_finetune folder) and there I have the max_seq_length =256, however when running this (run_lm_finetuning.py) , I have block_size=128.

Any idea of what is the error for?

wontfix

Most helpful comment

@ehsan-soe I fixed the problem by truncating incomplete batches. So if there are 2001 examples and my batch size = 2, then I truncate the last example and train on the first 2000. This has fixed it for me both with and without distributed. My load_and_cache function now looks like this

def load_and_cache_examples(args, tokenizer, evaluate=False, fpath=None):
    if fpath:
        dataset = TextDataset(tokenizer, args, fpath)
    else:
        dataset = TextDataset(tokenizer, args, args.eval_data_path if evaluate else args.train_data_path)

    # Ignore incomplete batches
    # If you don't do this, you'll get an error at the end of training
    n = len(dataset) % args.per_gpu_train_batch_size
    if n != 0:
        dataset.examples = dataset.examples[:-n]
    return dataset

All 14 comments

This is a wild guess since I don't have access to your modified version, but I feel like this has to do with a mismatch in the batch size (expecting a batch size of 4 but receiving a batch size of 3).

Could you check your input tensor and label tensor sizes and get back to me so I can try and reproduce it on my end?

@LysandreJik I saved them inputs and reload it. It is of size [7, 256].

The thing is I don't know why the error is having a size which is 5 dimensional rather than 3 or even in the attention split, the size should be of dimension 4 [batchsize, sequence_length, head, head_feature]

Also, how should I know where the error exactly come from? like which line of code in the modeling scripts cause this.

I tried to save the specific batch of inputs before the program gives this error and terminate. Out of the program, I used load the inputs and pass it to the line of code that cause the error, and this doesn't give me any error. However, when trying to train the model inside the script this throws error.

I guess it might have to do sth with parallel/distributed training

Was a solution to this issue found? I'm receiving the same error. It works with batch size = 1 but if I can use a larger batch size I'd like to.

@isabelcachola for some dataset it works and for some it gives this error. I am getting the same error again now for the last step of the first batch. yours' the same?
The problem is due to parallel and distributed/ multi gpu training I guess.
I have two gpus but when I run, only one of my gpus get occupied.

Any thought on that?

@isabelcachola one thing that I tried which seems to be working and didn't throw error is to set args.n_gpu= 1, then it would do distributed training.
but not sure if this is a right way of getting around the issue.

@isabelcachola this script doesn't save the best model,it saves the last one, right?

@ehsan-soe I fixed the problem by truncating incomplete batches. So if there are 2001 examples and my batch size = 2, then I truncate the last example and train on the first 2000. This has fixed it for me both with and without distributed. My load_and_cache function now looks like this

def load_and_cache_examples(args, tokenizer, evaluate=False, fpath=None):
    if fpath:
        dataset = TextDataset(tokenizer, args, fpath)
    else:
        dataset = TextDataset(tokenizer, args, args.eval_data_path if evaluate else args.train_data_path)

    # Ignore incomplete batches
    # If you don't do this, you'll get an error at the end of training
    n = len(dataset) % args.per_gpu_train_batch_size
    if n != 0:
        dataset.examples = dataset.examples[:-n]
    return dataset

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I am having this same issue trying to train a GPT2LmHead model on 4 Tesla V100s

@zbloss Look at my answer above and see if that solves your issue

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

"dataloader_drop_last = True " may help?
You can refer to this pr

I think this can solve it.
Duplicate of #https://github.com/huggingface/transformers/issues/1220#issuecomment-557237248

Also, you can set the parameter drop_last in your DataLoader like this:
tain_text = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, drop_last=True)

Was this page helpful?
0 / 5 - 0 ratings