Hi
I am using run_lm_finetuning.py, I encountered the following issues:
File "run_lm_finetuning.py", line 712, in <module>
main()
File "run_lm_finetuning.py", line 662, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 198, in train
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 64, in __init__
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integeral value, but got num_samples=0
global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0]) can crash, let assume the "args.model_name_or_path=gpt2" then the result of the expression is int(""), which will crash, maybe setting it to 0?
when running the script for bert model I got also the following error, I am using pytorch 1.2.
(transformer) rkarimi@italix17:/idiap/user/rkarimi/dev/lm_heads$ python run_lm_finetuning.py --output_dir=/idiap/temp/rkarimi/lm_heads/distilbert --model_type=distilbert --model_name_or_path=/idiap/temp/rkarimi/pretrained_transformers/bert_distil --do_train --train_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw --do_eval --eval_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw --mlm --block_size=511
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
01/02/2020 16:53:27 - WARNING - __main__ - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/02/2020 16:53:27 - INFO - transformers.configuration_utils - loading configuration file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/config.json
01/02/2020 16:53:27 - INFO - transformers.configuration_utils - Model config {
"activation": "gelu",
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"finetuning_task": null,
"hidden_dim": 3072,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"max_position_embeddings": 512,
"n_heads": 12,
"n_layers": 6,
"num_labels": 2,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pruned_heads": {},
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"torchscript": false,
"use_bfloat16": false,
"vocab_size": 30522
}
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - Model name '/idiap/temp/rkarimi/pretrained_transformers/bert_distil' not found in model shortcut name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). Assuming '/idiap/temp/rkarimi/pretrained_transformers/bert_distil' is a path or url to a directory containing tokenizer files.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - Didn't find file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/added_tokens.json. We won't load it.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - Didn't find file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/special_tokens_map.json. We won't load it.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - Didn't find file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/tokenizer_config.json. We won't load it.
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - loading file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/vocab.txt
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - loading file None
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - loading file None
01/02/2020 16:53:27 - INFO - transformers.tokenization_utils - loading file None
01/02/2020 16:53:27 - INFO - transformers.modeling_utils - loading weights file /idiap/temp/rkarimi/pretrained_transformers/bert_distil/pytorch_model.bin
01/02/2020 16:53:28 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=511, cache_dir='', config_name='', device=device(type='cpu'), do_eval=True, do_lower_case=False, do_train=True, eval_all_checkpoints=False, eval_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path='/idiap/temp/rkarimi/pretrained_transformers/bert_distil', model_type='distilbert', n_gpu=0, no_cuda=False, num_train_epochs=1.0, output_dir='/idiap/temp/rkarimi/lm_heads/distilbert', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=50, save_total_limit=None, seed=42, server_ip='', server_port='', tokenizer_name='', train_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw', warmup_steps=0, weight_decay=0.0)
01/02/2020 16:53:28 - INFO - __main__ - Creating features from dataset file at /idiap/temp/rkarimi/resources/wikitext-2-raw
01/02/2020 16:53:53 - INFO - __main__ - Saving features into cached file /idiap/temp/rkarimi/pretrained_transformers/bert_distil_cached_lm_511_wiki.train.raw
01/02/2020 16:53:53 - INFO - __main__ - ***** Running training *****
01/02/2020 16:53:53 - INFO - __main__ - Num examples = 4303
01/02/2020 16:53:53 - INFO - __main__ - Num Epochs = 1
01/02/2020 16:53:53 - INFO - __main__ - Instantaneous batch size per GPU = 4
01/02/2020 16:53:53 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
01/02/2020 16:53:53 - INFO - __main__ - Gradient Accumulation steps = 1
01/02/2020 16:53:53 - INFO - __main__ - Total optimization steps = 1076
01/02/2020 16:53:53 - INFO - __main__ - Continuing training from checkpoint, will skip to saved global_step
01/02/2020 16:53:53 - INFO - __main__ - Continuing training from epoch 0
01/02/2020 16:53:53 - INFO - __main__ - Continuing training from global step 0
01/02/2020 16:53:53 - INFO - __main__ - Will skip the first 0 steps in the first epoch
Epoch: 0%| | 0/1 [00:00<?, ?it/sTraceback (most recent call last): | 0/1076 [00:00<?, ?it/s]
File "run_lm_finetuning.py", line 738, in <module>
main()
File "run_lm_finetuning.py", line 688, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 325, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/transformers/modeling_distilbert.py", line 540, in forward
inputs_embeds=inputs_embeds)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/transformers/modeling_distilbert.py", line 477, in forward
inputs_embeds = self.embeddings(input_ids) # (bs, seq_length, dim)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/transformers/modeling_distilbert.py", line 96, in forward
position_embeddings = self.position_embeddings(position_ids) # (bs, max_seq_length, dim)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/nn/functional.py", line 1467, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237
The issue will resolve by setting smaller block_size <= 510, it would be very nice to document this in the codes that one needs to set the block_size <= 510 as a temporary solution. thanks
thanks.
Hello I also got the same error while running BERT.
Traceback (most recent call last):
File "code/transformers-2.3.0/examples/run_lm_finetuning.py", line 713, in
main()
File "code/transformers-2.3.0/examples/run_lm_finetuning.py", line 663, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "code/transformers-2.3.0/examples/run_lm_finetuning.py", line 268, in train
global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
ValueError: invalid literal for int() with base 10: 'pytorch'
Could anyone help?
@calusbr
Hi, for the error you reported if you set global_step = 0 it should work.
Hi, thank you for raising this issue. Could you please let me know if 27c1b656cca75efa0cc414d3bf4e6aacf24829de fixed this issue by trying the updated script?
Hello, to solve this problem I added my checkpoint to a folder that has the same Transformer output.
new folder -> chekpoint-0
Folders:
|
chekpoint-0
| vocab.txt
| pytorch_model.bin
| config.json
global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
Result:
global_step = 0
@rabeehk hello! I am also faced with the "ValueError: num_samples should be a positive integeral value, but got num_samples=0", Have you fixed this problem? thank you~
@LysandreJik I tried it 2020-1-9, It seems that this problem "ValueError: num_samples should be a positive integeral value, but got num_samples=0" still exists...
Hi
I tested it, it does fix the first issue, thanks, but as I wrote in the
first email, there are a couple of more errors, currently
I got this errors, thanks:
(transformer) rkarimi@vgnc002:/idiap/user/rkarimi/dev/lm_heads$ python
run_lm_original.py --output_dir=/idiap/temp/rkarimi/lm_heads/bert_original
--model_type=bert
--model_name_or_path=/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/
--do_train
--train_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw
--do_eval
--eval_data_file=/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw
--mlm --block_size 510 --overwrite_output_dir --num_train_epochs 3
--evaluate_during_training
01/09/2020 09:37:59 - WARNING - __main__ - Process rank: -1, device:
cuda, n_gpu: 1, distributed training: False, 16-bits training: False
01/09/2020 09:37:59 - INFO - transformers.configuration_utils - loading
configuration file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/config.json
01/09/2020 09:37:59 - INFO - transformers.configuration_utils - Model
config {
"attention_probs_dropout_prob": 0.1,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_labels": 2,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pruned_heads": {},
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 30522
}
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Model name
'/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/' not found
in model shortcut name list (bert-base-uncased, bert-large-uncased,
bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased,
bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,
bert-large-uncased-whole-word-masking-finetuned-squad,
bert-large-cased-whole-word-masking-finetuned-squad,
bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased,
bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1,
bert-base-finnish-uncased-v1). Assuming
'/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/' is a path
or url to a directory containing tokenizer files.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Didn't
find file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/added_tokens.json.
We won't load it.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Didn't
find file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/special_tokens_map.json.
We won't load it.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - Didn't
find file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/tokenizer_config.json.
We won't load it.
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file /idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/vocab.txt
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file None
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file None
01/09/2020 09:37:59 - INFO - transformers.tokenization_utils - loading
file None
01/09/2020 09:37:59 - INFO - transformers.modeling_utils - loading
weights file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/pytorch_model.bin
01/09/2020 09:38:04 - INFO - transformers.modeling_utils - Weights from
pretrained model not used in BertForMaskedLM:
['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
01/09/2020 09:38:09 - INFO - __main__ - Training/evaluation parameters
Namespace(adam_epsilon=1e-08, block_size=510, cache_dir='', config_name='',
device=device(type='cuda'), do_eval=True, do_lower_case=False,
do_train=True, eval_all_checkpoints=False,
eval_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.test.raw',
evaluate_during_training=True, fp16=False, fp16_opt_level='O1',
gradient_accumulation_steps=1, learning_rate=5e-05, local_rank=-1,
logging_steps=50, max_grad_norm=1.0, max_steps=-1, mlm=True,
mlm_probability=0.15,
model_name_or_path='/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/',
model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=3.0,
output_dir='/idiap/temp/rkarimi/lm_heads/bert_original',
overwrite_cache=False, overwrite_output_dir=True,
per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=50,
save_total_limit=None, seed=42, server_ip='', server_port='',
tokenizer_name='',
train_data_file='/idiap/temp/rkarimi/resources/wikitext-2-raw/wiki.train.raw',
warmup_steps=0, weight_decay=0.0)
01/09/2020 09:38:09 - INFO - __main__ - Loading features from cached file
/idiap/temp/rkarimi/pretrained_transformers/bert-base-uncased/_cached_lm_510_wiki.train.raw
01/09/2020 09:38:09 - INFO - __main__ - Running training
01/09/2020 09:38:09 - INFO - __main__ - Num examples = 4312
01/09/2020 09:38:09 - INFO - __main__ - Num Epochs = 3
01/09/2020 09:38:09 - INFO - __main__ - Instantaneous batch size per
GPU = 4
01/09/2020 09:38:09 - INFO - __main__ - Total train batch size (w.
parallel, distributed & accumulation) = 4
01/09/2020 09:38:09 - INFO - __main__ - Gradient Accumulation steps = 1
01/09/2020 09:38:09 - INFO - __main__ - Total optimization steps = 3234
01/09/2020 09:38:09 - INFO - __main__ - Starting fine-tuning.
Epoch: 0%|
| 0/3 [00:00<?,
?it/s/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [10,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [14,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t <
n_classes failed.
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/ClassNLLCriterion.cu:105:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *,
long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype =
float]: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t <
n_classes failed.
Traceback (most recent call last):
File "run_lm_original.py", line 717, in
main()
File "run_lm_original.py", line 667, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_original.py", line 316, in train
loss.backward()
File
"/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/tensor.py",
line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File
"/idiap/user/rkarimi/libs/anaconda3/envs/transformer/lib/python3.6/site-packages/torch/autograd/__init__.py",
line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: device-side assert
triggered
Epoch: 0%|
| 0/3 [00:00<?, ?it/s]
Iteration: 0%|
Best
Rabeeh
On Tue, Jan 7, 2020 at 4:19 PM Lysandre Debut notifications@github.com
wrote:
Hi, thank you for raising this issue. Could you please let me know if
27c1b65
https://github.com/huggingface/transformers/commit/27c1b656cca75efa0cc414d3bf4e6aacf24829de
fixed this issue by trying the updated script?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2380?email_source=notifications&email_token=ABP4ZCFDVP5F63P244QV3EDQ4SMPHA5CNFSM4KB3TOB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIJGNHA#issuecomment-571631260,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCGGGP5IA3UW4OEN6DLQ4SMPHANCNFSM4KB3TOBQ
.
@rabeehk, concerning your first issue:
block_size value is by default = -1, which creates the following error, can be solved by setting the default value to 512
the very first usage of args.block_size is to check if it is a negative value (e.g. -1) and to put it to the maximum model length. Is this not working in your case?
The issue will resolve by setting smaller block_size <= 510, it would be very nice to document this in the codes that one needs to set the block_size <= 510 as a temporary solution. thanks
This should be solved by the previously mentioned lines as well.
In mask_tokens function, the following lines needs to be set to -1 not -100 which is the ignore_index used in the "BertForMaskedLM" model:
labels[~masked_indices] = -100 => -1
This is not the case anymore, as you can see in the BertForMaskedLM source code. The examples are maintained to work with the current master branch, and not a specific release. If you want to run scripts with a specific version, you can get them from a specific version tag on GitHub, e.g. version 2.3.0.
Please let me know if you can see why the block size doesn't seem to be set to the maximum value, I'll fix it if it is an issue with the script. Thank you @rabeehk!
@rabeehk Hi ! May I ask you that you fixed the problem ""ValueError: num_samples should be a positive integeral value, but got num_samples=0" because you set the "global_step = 0" ? like this:
`try:
# set global_step to gobal_step of last saved checkpoint from model path
checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
global_step = int(checkpoint_suffix)
epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)`
Should I change the "global_step = int(checkpoint_suffix)" to "global_step = 0" ? thanks !
Hi
No. You need to set block-size to a positive number try with 510 maybe.
Best
Rabeeh
On Thu, Jan 9, 2020, 12:14 PM JiangYanting notifications@github.com wrote:
@rabeehk https://github.com/rabeehk Hi ! May I ask you that you fixed
the problem ""ValueError: num_samples should be a positive integeral value,
but got num_samples=0" because you set the "global_step = 0" ? like this:try: # set global_step to gobal_step of last saved checkpoint from model
path checkpoint_suffix =
args.model_name_or_path.split("-")[-1].split("/")[0] global_step =
int(checkpoint_suffix) epochs_trained = global_step //
(len(train_dataloader) // args.gradient_accumulation_steps)
steps_trained_in_current_epoch = global_step % (len(train_dataloader) //
args.gradient_accumulation_steps)Should I change the "global_step = int(checkpoint_suffix)" to "global_step
= 0" ? thanks !—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/2380?email_source=notifications&email_token=ABP4ZCBCKG7SAYK4YPHVPFTQ44BJJA5CNFSM4KB3TOB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIP6HDQ#issuecomment-572515214,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABP4ZCB3F7ZALBP6RG4HI63Q44BJJANCNFSM4KB3TOBQ
.
Changing from 512 to 510 worked for me. I would think that we should be able to use 512, the max size for Bert input? Or there something I'm overlooking?
Hi, I just encountered the same error finetuning a custom gpt-2 model with run_language_modeling.py on Colab.
Traceback (most recent call last):
File "run_language_modeling.py", line 799, in <module>
main()
File "run_language_modeling.py", line 749, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_language_modeling.py", line 245, in train
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 94, in __init__
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
I solved by specifying the --block_size, as @rabeehk said.
Might be worth mentioning that in your docs, or have a default setup that works out of the box ? I also had to dig into the code to find the --should_continue and --overwrite_output_dir flags to continue training, is there a page where that is discussed by any chance?
As an aside, I can't seem to find a flag to print the loss during training? I see there is a log/save step every 500 iterations, but it doesn't give any of these stats. Is there something super obvious I am missing?
Thanks in any case!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi, I just encountered the same error finetuning a custom gpt-2 model with
run_language_modeling.pyon Colab.I solved by specifying the
--block_size, as @rabeehk said.Might be worth mentioning that in your docs, or have a default setup that works out of the box ? I also had to dig into the code to find the
--should_continueand--overwrite_output_dirflags to continue training, is there a page where that is discussed by any chance?As an aside, I can't seem to find a flag to print the loss during training? I see there is a log/save step every 500 iterations, but it doesn't give any of these stats. Is there something super obvious I am missing?
Thanks in any case!