Transformers: errors encountered with run_lm_finetuning.py

Created on 1 Jan 2020 · 13Comments · Source: huggingface/transformers

Hi
I am using run_lm_finetuning.py, I encountered the following issues:

block_size value is by default = -1, which creates the following error, can be solved by setting the default value to 512:

  File "run_lm_finetuning.py", line 712, in <module>
    main()
  File "run_lm_finetuning.py", line 662, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 198, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 64, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integeral value, but got num_samples=0

global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0]) can crash, let assume the "args.model_name_or_path=gpt2" then the result of the expression is int(""), which will crash, maybe setting it to 0?
when running the script for bert model I got also the following error, I am using pytorch 1.2.

(transformer) rkarimi@italix17:/idiap/user/rkarimi/dev/lm_heads$ python run_lm_finetuning.py  --output_dir=/idiap/temp/rkarimi/lm_heads/distilbert  --model_type=distilbert --model_name_or_path=/idiap/temp/rkarimi/pretrained_transformers/bert_distil  --do_train  --train_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw  --do_eval  --eval_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw --mlm --block_size=511
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
01/02/2020 16:53:27 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/02/2020 16:53:27 - INFO - transformers.configuration_utils -   loading configuration file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/config.json
01/02/2020 16:53:27 - INFO - transformers.configuration_utils -   Model config {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": null,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "max_position_embeddings": 512,
  "n_heads": 12,
  "n_layers": 6,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pruned_heads": {},
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torchscript": false,
  "use_bfloat16": false,
  "vocab_size": 30522
}

01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   Model name '/idiap/temp/rkarimi/pretrained_transformers/bert_distil' not found in model shortcut name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). Assuming '/idiap/temp/rkarimi/pretrained_transformers/bert_distil' is a path or url to a directory containing tokenizer files.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   Didn't find file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/added_tokens.json. We won't load it.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   Didn't find file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/special_tokens_map.json. We won't load it.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   Didn't find file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/tokenizer_config.json. We won't load it.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   loading file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/vocab.txt
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   loading file None
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   loading file None
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils -   loading file None
01/02/2020 16:53:27 - INFO - transformers.modeling_utils -   loading weights file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/pytorch_model.bin
01/02/2020 16:53:28 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=511, cache_dir='', config_name='', device=device(type='cpu'), do_eval=True, do_lower_case=False, do_train=True, eval_all_checkpoints=False, eval_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path='/idiap/temp/rkarimi/pretrained_transformers/bert_distil', model_type='distilbert', n_gpu=0, no_cuda=False, num_train_epochs=1.0, output_dir='/idiap/temp/rkarimi/lm_heads/distilbert', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=50, save_total_limit=None, seed=42, server_ip='', server_port='', tokenizer_name='', train_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw', warmup_steps=0, weight_decay=0.0)
01/02/2020 16:53:28 - INFO - __main__ -   Creating features from dataset file at /idiap/temp/rkarimi/resources/wikitext-2-raw
01/02/2020 16:53:53 - INFO - __main__ -   Saving features into cached file /idiap/temp/rkarimi/pretrained_transformers/bert_distil_cached_lm_511_wiki.train.raw
01/02/2020 16:53:53 - INFO - __main__ -   ***** Running training *****
01/02/2020 16:53:53 - INFO - __main__ -     Num examples = 4303
01/02/2020 16:53:53 - INFO - __main__ -     Num Epochs = 1
01/02/2020 16:53:53 - INFO - __main__ -     Instantaneous batch size per GPU = 4
01/02/2020 16:53:53 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
01/02/2020 16:53:53 - INFO - __main__ -     Gradient Accumulation steps = 1
01/02/2020 16:53:53 - INFO - __main__ -     Total optimization steps = 1076
01/02/2020 16:53:53 - INFO - __main__ -     Continuing training from checkpoint, will skip to saved global_step
01/02/2020 16:53:53 - INFO - __main__ -     Continuing training from epoch 0
01/02/2020 16:53:53 - INFO - __main__ -     Continuing training from global step 0
01/02/2020 16:53:53 - INFO - __main__ -     Will skip the first 0 steps in the first epoch
Epoch:   0%|                                                                                                                                                            | 0/1 [00:00<?, ?it/sTraceback (most recent call last):                                                                                                                                    | 0/1076 [00:00<?, ?it/s]
  File "run_lm_finetuning.py", line 738, in <module>
    main()
  File "run_lm_finetuning.py", line 688, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 325, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/transformers/modeling_distilbert.py", line 540, in forward
    inputs_embeds=inputs_embeds)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/transformers/modeling_distilbert.py", line 477, in forward
    inputs_embeds = self.embeddings(input_ids)   # (bs, seq_length, dim)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/transformers/modeling_distilbert.py", line 96, in forward
    position_embeddings = self.position_embeddings(position_ids)        # (bs, max_seq_length, dim)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/functional.py", line 1467, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

The issue will resolve by setting smaller block_size <= 510, it would be very nice to document this in the codes that one needs to set the block_size <= 510 as a temporary solution. thanks

In mask_tokens function, the following lines needs to be set to -1 not -100 which is the ignore_index used in the "BertForMaskedLM" model:
labels[~masked_indices] = -100 => -1

thanks.

wontfix

Source

rabeehk

👍7

Most helpful comment

Hi, I just encountered the same error finetuning a custom gpt-2 model with run_language_modeling.py on Colab.

Traceback (most recent call last):
  File "run_language_modeling.py", line 799, in <module>
    main()
  File "run_language_modeling.py", line 749, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_language_modeling.py", line 245, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

I solved by specifying the --block_size, as @rabeehk said.
Might be worth mentioning that in your docs, or have a default setup that works out of the box ? I also had to dig into the code to find the --should_continue and --overwrite_output_dir flags to continue training, is there a page where that is discussed by any chance?

As an aside, I can't seem to find a flag to print the loss during training? I see there is a log/save step every 500 iterations, but it doesn't give any of these stats. Is there something super obvious I am missing?

Thanks in any case!

jchwenger on 8 Mar 2020

👍4

All 13 comments

Hello I also got the same error while running BERT.

Traceback (most recent call last):
File "code/transformers-2.3.0/examples/run_lm_finetuning.py", line 713, in
main()
File "code/transformers-2.3.0/examples/run_lm_finetuning.py", line 663, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "code/transformers-2.3.0/examples/run_lm_finetuning.py", line 268, in train
global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
ValueError: invalid literal for int() with base 10: 'pytorch'

Could anyone help?

calusbr on 2 Jan 2020

@calusbr
Hi, for the error you reported if you set global_step = 0 it should work.

rabeehk on 2 Jan 2020

Hi, thank you for raising this issue. Could you please let me know if 27c1b656cca75efa0cc414d3bf4e6aacf24829de fixed this issue by trying the updated script?

LysandreJik on 7 Jan 2020

Hello, to solve this problem I added my checkpoint to a folder that has the same Transformer output.

new folder -> chekpoint-0

Folders:
|
chekpoint-0
| vocab.txt
| pytorch_model.bin
| config.json

global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])

Result:
global_step = 0

calusbr on 8 Jan 2020

@rabeehk hello! I am also faced with the "ValueError: num_samples should be a positive integeral value, but got num_samples=0", Have you fixed this problem? thank you~

JiangYanting on 9 Jan 2020

👍1

@LysandreJik I tried it 2020-1-9, It seems that this problem "ValueError: num_samples should be a positive integeral value, but got num_samples=0" still exists...

JiangYanting on 9 Jan 2020

👍1

Hi
I tested it, it does fix the first issue, thanks, but as I wrote in the
first email, there are a couple of more errors, currently
I got this errors, thanks:

(transformer) rkarimi@vgnc002:/idiap/user/rkarimi/dev/lm_heads$ python
run_lm_original.py --output_dir=/idiap/temp/rkarimi/lm_heads/bert_original
--model_type=bert
--model_name_or_path=/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/
--do_train
--train_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw
--do_eval
--eval_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw
--mlm --block_size 510 --overwrite_output_dir --num_train_epochs 3
--evaluate_during_training
01/09/2020 09:37:59 - WARNING - __main__ - Process rank: -1, device:
cuda, n_gpu: 1, distributed training: False, 16-bits training: False
01/09/2020 09:37:59 - INFO - transformers.configuration_utils - loading
configuration file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/config.json
01/09/2020 09:37:59 - INFO - transformers.configuration_utils - Model
config {
"attention_probs_dropout_prob": 0.1,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_labels": 2,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pruned_heads": {},
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 30522
}

01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Model name
'/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/' not found
in model shortcut name list (bert-base-uncased, bert-large-uncased,
bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased,
bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,
bert-large-uncased-whole-word-masking-finetuned-squad,
bert-large-cased-whole-word-masking-finetuned-squad,
bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased,
bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1,
bert-base-finnish-uncased-v1). Assuming
'/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/' is a path
or url to a directory containing tokenizer files.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Didn't
find file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/added_tokens.json.
We won't load it.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Didn't
find file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/special_tokens_map.json.
We won't load it.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Didn't
find file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/tokenizer_config.json.
We won't load it.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file /idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/vocab.txt
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file None
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file None
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file None
01/09/2020 09:37:59 - INFO - transformers.modeling_utils - loading
weights file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/pytorch_model.bin
01/09/2020 09:38:04 - INFO - transformers.modeling_utils - Weights from
pretrained model not used in BertForMaskedLM:
['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
01/09/2020 09:38:09 - INFO - __main__ - Training/evaluation parameters
Namespace(adam_epsilon=1e-08, block_size=510, cache_dir='', config_name='',
device=device(type='cuda'), do_eval=True, do_lower_case=False,
do_train=True, eval_all_checkpoints=False,
eval_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw',
evaluate_during_training=True, fp16=False, fp16_opt_level='O1',
gradient_accumulation_steps=1, learning_rate=5e-05, local_rank=-1,
logging_steps=50, max_grad_norm=1.0, max_steps=-1, mlm=True,
mlm_probability=0.15,
model_name_or_path='/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/',
model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=3.0,
output_dir='/idiap/temp/rkarimi/lm_heads/bert_original',
overwrite_cache=False, overwrite_output_dir=True,
per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=50,
save_total_limit=None, seed=42, server_ip='', server_port='',
tokenizer_name='',
train_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw',
warmup_steps=0, weight_decay=0.0)
01/09/2020 09:38:09 - INFO - __main__ - Loading features from cached file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/_cached_lm_510_wiki.train.raw
01/09/2020 09:38:09 - INFO - __main__ - Running training
01/09/2020 09:38:09 - INFO - __main__ - Num examples = 4312
01/09/2020 09:38:09 - INFO - __main__ - Num Epochs = 3
01/09/2020 09:38:09 - INFO - __main__ - Instantaneous batch size per
GPU = 4
01/09/2020 09:38:09 - INFO - __main__ - Total train batch size (w.
parallel, distributed & accumulation) = 4
01/09/2020 09:38:09 - INFO - __main__ - Gradient Accumulation steps = 1
01/09/2020 09:38:09 - INFO - __main__ - Total optimization steps = 3234
01/09/2020 09:38:09 - INFO - __main__ - Starting fine-tuning.
Epoch: 0%|

             | 0/3 [00:00<?,

?it/s/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [14,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "run_lm_original.py", line 717, in
main()
File "run_lm_original.py", line 667, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_original.py", line 316, in train
loss.backward()
File
"/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/tensor.py",
line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File
"/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/autograd/__init__.py",
line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: device-side assert
triggered
Epoch: 0%|

             | 0/3 [00:00<?, ?it/s]

Iteration: 0%|

Best
Rabeeh

On Tue, Jan 7, 2020 at 4:19 PM Lysandre Debut notifications@github.com
wrote:

Hi, thank you for raising this issue. Could you please let me know if
27c1b65
https://github.com/huggingface/transformers/commit/27c1b656cca75efa0cc414d3bf4e6aacf24829de
fixed this issue by trying the updated script?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2380?email_source=notifications&email_token=ABP4ZCFDVP5F63P244QV3EDQ4SMPHA5CNFSM4KB3TOB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIJGNHA#issuecomment-571631260,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCGGGP5IA3UW4OEN6DLQ4SMPHANCNFSM4KB3TOBQ
.

rabeehk on 9 Jan 2020

@rabeehk, concerning your first issue:

block_size value is by default = -1, which creates the following error, can be solved by setting the default value to 512

the very first usage of args.block_size is to check if it is a negative value (e.g. -1) and to put it to the maximum model length. Is this not working in your case?

The issue will resolve by setting smaller block_size <= 510, it would be very nice to document this in the codes that one needs to set the block_size <= 510 as a temporary solution. thanks

This should be solved by the previously mentioned lines as well.

In mask_tokens function, the following lines needs to be set to -1 not -100 which is the ignore_index used in the "BertForMaskedLM" model:
labels[~masked_indices] = -100 => -1

This is not the case anymore, as you can see in the BertForMaskedLM source code. The examples are maintained to work with the current master branch, and not a specific release. If you want to run scripts with a specific version, you can get them from a specific version tag on GitHub, e.g. version 2.3.0.

Please let me know if you can see why the block size doesn't seem to be set to the maximum value, I'll fix it if it is an issue with the script. Thank you @rabeehk!

LysandreJik on 9 Jan 2020

@rabeehk Hi ! May I ask you that you fixed the problem ""ValueError: num_samples should be a positive integeral value, but got num_samples=0" because you set the "global_step = 0" ? like this:

`try:
# set global_step to gobal_step of last saved checkpoint from model path

        checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]

        global_step = int(checkpoint_suffix)

        epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)

        steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)`

Should I change the "global_step = int(checkpoint_suffix)" to "global_step = 0" ？ thanks !

JiangYanting on 9 Jan 2020

Hi
No. You need to set block-size to a positive number try with 510 maybe.
Best
Rabeeh

On Thu, Jan 9, 2020, 12:14 PM JiangYanting notifications@github.com wrote:

@rabeehk https://github.com/rabeehk Hi ! May I ask you that you fixed
the problem ""ValueError: num_samples should be a positive integeral value,
but got num_samples=0" because you set the "global_step = 0" ? like this:

try: # set global_step to gobal_step of last saved checkpoint from model
path checkpoint_suffix =
args.model_name_or_path.split("-")[-1].split("/")[0] global_step =
int(checkpoint_suffix) epochs_trained = global_step //
(len(train_dataloader) // args.gradient_accumulation_steps)
steps_trained_in_current_epoch = global_step % (len(train_dataloader) //
args.gradient_accumulation_steps)

Should I change the "global_step = int(checkpoint_suffix)" to "global_step
= 0" ？ thanks !

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2380?email_source=notifications&email_token=ABP4ZCBCKG7SAYK4YPHVPFTQ44BJJA5CNFSM4KB3TOB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIP6HDQ#issuecomment-572515214,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCB3F7ZALBP6RG4HI63Q44BJJANCNFSM4KB3TOBQ
.

rabeehk on 9 Jan 2020

👍1

Changing from 512 to 510 worked for me. I would think that we should be able to use 512, the max size for Bert input? Or there something I'm overlooking?

Santosh-Gupta on 27 Jan 2020

👍3

Hi, I just encountered the same error finetuning a custom gpt-2 model with run_language_modeling.py on Colab.

Traceback (most recent call last):
  File "run_language_modeling.py", line 799, in <module>
    main()
  File "run_language_modeling.py", line 749, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_language_modeling.py", line 245, in train
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

Thanks in any case!

jchwenger on 8 Mar 2020

👍4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.