Transformers: Fine-tune RoBERTa on WikiText-2

Created on 17 Oct 2019 · 15Comments · Source: huggingface/transformers

❓ Questions & Help

I am trying to train Roberta using the run_lm_finetuning.py script and TRAIN_FILE=wiki.train.raw, TEST_FILE=wiki.test.raw, basically, I use the demo data (wikiText-2) as described at https://huggingface.co/transformers/examples.html

CUDA_LAUNCH_BLOCKING=1 python run_lm_finetuning.py \
--output_dir=output \
--model_type=roberta \
--model_name_or_path=roberta-base \
--do_train \
--train_data_file=$TRAIN_FILE \
--do_eval \
--eval_data_file=$TEST_FILE \
--mlm

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [386,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.

Debugging, I saw it fails to get an embedding that exceeds the max size, but I am not sure in which module to correct. Also, I assume this should have run correctly given that it is the dataset used in the example at https://huggingface.co/transformers/examples.html.
Any help is greatly appreciated.
Thanks.

wontfix

Source

estoica111

Most helpful comment

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [386,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.

Debugging, I saw it fails to get an embedding that exceeds the max size, but I am not sure in which module to correct. Also, I assume this should have run correctly given that it is the dataset used in the example at https://huggingface.co/transformers/examples.html.
Any help is greatly appreciated.
Thanks.

I encountered essentially the same error when using RoBERTa for SQuAD.

What I found was that the Tokenizer.encode_plus() generates a token_type_ids vector that contains 1s and 0s when two sequences are fed in (question and passage tokens in the case of SQuAD).

The RobertaModel tries to look up these indices in RobertaModel.embeddings.token_type_embeddings. However, the size of the token_type_embeddings is [1,768] and so the error that started this issue arises when it tries to look up the index 1.

I think one solution would be to set token_type_ids to None in the forward method of RobertaModel

brandenchan on 12 Nov 2019

👍4

All 15 comments

Hi, could you give us a bit more information? For example, you seem to be running this on a GPU, are you running on a distributed setting? Could you list your software versions (python, torch, transformers)?

LysandreJik on 17 Oct 2019

Thank you for your response. I am running on a single machine with one gpu,
Python 3.6.8, pytorch_transformers 1.2.0 (from setup.py), torch>=1.0.0
(from requirements.txt). Linux 4.15.0-1044-gcp, NVIDIA-SMI 418.40.04,
Driver Version: 418.40.04, CUDA Version: 10.1. Thank you for your help.

On Thu, Oct 17, 2019 at 11:26 AM Lysandre Debut notifications@github.com
wrote:

Hi, could you give us a bit more information? For example, you seem to be
running this on a GPU, are you running on a distributed setting? Could you
list your software versions (python, torch, transformers)?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1538?email_source=notifications&email_token=ANQITTRRWMQRRYU7CBCTFXLQPCU6ZA5CNFSM4JBR7JA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBRCGXQ#issuecomment-543302494,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANQITTTS3UTD4XDKO5MGVTTQPCU6ZANCNFSM4JBR7JAQ
.

estoica111 on 17 Oct 2019

Does the error still happen if you remove CUDA_LAUNCH_BLOCKING=1 ?

LysandreJik on 17 Oct 2019

yes, the error happens at
File "run_lm_finetuning.py", line 472, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 209, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else
model(inputs, labels=labels)
.......................(other info) .......
result = self.forward(input, *kwargs)
output = input.matmul(weight.t())
RuntimeError: cublas runtime error : resource allocation failed at
/pytorch/aten/src/THC/THCGeneral.cpp:216
Epoch: 0%|

| 0/1 [00:00 Iteration: 0%|
Thank you.

On Thu, Oct 17, 2019 at 1:46 PM Lysandre Debut notifications@github.com
wrote:

Does the error still happen if you remove CUDA_LAUNCH_BLOCKING=1 ?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/1538?email_source=notifications&email_token=ANQITTS7JZTEFDMUUEHJ3NDQPDFJTA5CNFSM4JBR7JA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBRO53Q#issuecomment-543354606,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ANQITTWFFITTOT7DILCJQMDQPDFJTANCNFSM4JBR7JAQ
.

estoica111 on 17 Oct 2019

I have also noticed this issue when trying to fine-tune a RoBERTa language model.

Part of the issue appears to be in the the calculation of the maximum sequence length in run_lm_finetuning.py

if args.block_size <= 0:
        args.block_size = tokenizer.max_len_single_sentence  # Our input block size will be the max possible for the model

This produces a cached file like this: cached_lm_999999999998_wiki.train.raw
Manually checking shows that it is indeed setting the args.block_size parameter to 999999999998

Adding the --block-size = 512 argument prevents this, but then leads to a similar index error to the one @estoica111 is experiencing.

Strangely, if I reduce to --block-size = 500, the model trains successfully, but the reported perplexity on the test set seems far too low:

10/18/2019 15:35:44 - INFO - __main__ -   Saving features into cached file ~/wikitext-2-raw/cached_lm_500_wiki.test.raw
10/18/2019 15:35:44 - INFO - __main__ -   ***** Running evaluation  *****
10/18/2019 15:35:44 - INFO - __main__ -     Num examples = 572
10/18/2019 15:35:44 - INFO - __main__ -     Batch size = 32
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:08<00:00,  2.14it/s]
10/18/2019 15:35:53 - INFO - __main__ -   ***** Eval results  *****
10/18/2019 15:35:53 - INFO - __main__ -     perplexity = tensor(1.0631)

Update: I get the exact same perplexity (1.0631) even with the standard pre-trained RoBERTa model on wikitext-2-raw test set. Very confused.

dscripka on 18 Oct 2019

I'm having a hard time replicating this error in transformers 2.1.1. Would it be possible for you to try this on the latest version and let me know your results?

I get a 1.03 perplexity fine-tuning on wiki.train.raw and evaluating on wiki.test.raw, vs 1.45 without fine-tuning.

LysandreJik on 24 Oct 2019

@LysandreJik, I was on 2.1.1, but just in case I did a full-reinstall of the environment from master and that seems to have fixed the perplexity issue (now getting 1.03 - 1.06 after finetuning in wiki.train.raw.

However, the default behavior for block_size still does not work with the provided example. I have to set block_size 500, or I get the errors I described above. block_size 512 also still produces a similar error to @estoica111 .

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [171,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Evaluating:   0%|                                                                                                                                                                                                  | 0/18 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_lm_finetuning.py", line 543, in <module>
    main()
  File "run_lm_finetuning.py", line 535, in main
    result = evaluate(args, model, tokenizer, prefix=prefix)
  File "run_lm_finetuning.py", line 315, in evaluate
    outputs = model(batch, masked_lm_labels=batch) if args.mlm else model(batch, labels=batch)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/transformers/transformers/modeling_roberta.py", line 242, in forward
    head_mask=head_mask)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/transformers/transformers/modeling_roberta.py", line 182, in forward
    head_mask=head_mask)
  File "/home/dscripka/software/transformers/transformers/modeling_bert.py", line 627, in forward
    head_mask=head_mask)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/transformers/transformers/modeling_bert.py", line 348, in forward
    layer_outputs = layer_module(hidden_states, attention_mask, head_mask[i])
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/transformers/transformers/modeling_bert.py", line 326, in forward
    attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/transformers/transformers/modeling_bert.py", line 283, in forward
    self_outputs = self.self(input_tensor, attention_mask, head_mask)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/transformers/transformers/modeling_bert.py", line 202, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/dscripka/software/venv_transformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:216

Software versions:

Python: 3.6.5
Transformers: 2.1.1 (master)
Cuda: 10.0
Torch: 1.2.0

dscripka on 25 Oct 2019

Maybe this can be caused by RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. I got the same error on cuda. But trying to compute a single iteration on CPU, I get more clear error description: RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows.

vfdev-5 on 29 Oct 2019

@LysandreJik I am trying to fine-tune roberta following in the examples for using run_lm_finetuning.py. The only change I am making is using gradient accumulation as 2 and a gpu batch size of 2 as I was running into cuda memory issues. I am using the raw wiki data from the link provided.

I did a fresh install and have these on aws:
Python: 3.6.5
Transformers: 2.1.1 (master)
Cuda: 10.0
Torch: 1.2.0
1 V100 GPU

After fine-tuning on roberta-large I am getting a perplexity of 2.88 and when I do it on roberta-base I am getting a perplexity of 3.4.

Do you have any ideas on what I might be doing wrong or my setup or possible solutions?

cformosa on 7 Nov 2019

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo, TensorInfo, TensorInfo, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [386,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.

Debugging, I saw it fails to get an embedding that exceeds the max size, but I am not sure in which module to correct. Also, I assume this should have run correctly given that it is the dataset used in the example at https://huggingface.co/transformers/examples.html.
Any help is greatly appreciated.
Thanks.

I encountered essentially the same error when using RoBERTa for SQuAD.

What I found was that the Tokenizer.encode_plus() generates a token_type_ids vector that contains 1s and 0s when two sequences are fed in (question and passage tokens in the case of SQuAD).

I think one solution would be to set token_type_ids to None in the forward method of RobertaModel

brandenchan on 12 Nov 2019

👍4

Also having this issue training RoBERTa on MNLI. Similar to @brandenchan's observations, if I set the token_type_ids to all 0, then I don't have a a problem, but if I use encode_plus to generate the segment ids, then it triggers that error.

Additionally, it seems like RobertaConfig sets type_vocab_size=2, which seems like it should handle multiple segment ids? But the segment embeddings (currently) only have space for 1.

shreydesai on 16 Dec 2019

Maybe this can be caused by RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. I got the same error on cuda. But trying to compute a single iteration on CPU, I get more clear error description: RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows.

This is pretty weird as I was getting the same error when running the bert_lm_finetuning, I guess it's because the sentence's length is greater than 512, but as in the script TextDataset truncation is done via the parameter block_size, so this isn't supposed to appear... I set block_size<511(510,500...) and the error's gone.

JumpyPizza on 26 Dec 2019

From what read in this thread, it seems the cause for the issue @shreydesai points to is the absence of pre-trained token_type_id beyond a single [1, 768] parameter (explains why passing 0 doesn't trigger index out of range). The thread above offers a hack to get around this (i.e. modifying this parameter ad hoc) if multi-segment inputs are a must (which _is_ the case in my task).

To make this more useful, the hack snippet is (credit: Colanim)

model = RobertaModel.from_pretrained('roberta-base')
model.config.type_vocab_size = 2
single_emb = model.embeddings.token_type_embeddings
model.embeddings.token_type_embeddings = torch.nn.Embedding(2, single_emb.embedding_dim)
model.embeddings.token_type_embeddings.weight = torch.nn.Parameter(single_emb.weight.repeat([2, 1]))

If a headed model wrapper is used (e.g. RobertaForSequenceClassification), add .roberta after model to modify the RobertaModel object in the wrapper.
Having experimented in my classifier, I can contribute one evidence point that it doesn't break anything and works as intended.

suwangcompling on 2 Jan 2020

👍2

In my case (using ver 2.3), this hard coded padding_idx caused the problem.
If position_ids=None ^ seq_length = 512, the max value of position_ids exceeds 511 here, which is the largest index the embedding matrix can use.

The code in the latest version is different from the one above, but setting position_ids manually fixed the problem for me.

guchio3 on 1 Feb 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.