Traceback (most recent call last): | 1/87970 [00:00<8:35:35, 2.84it/s]
File "./run_squad.py", line 990, in
main()
File "./run_squad.py", line 922, in main
is_nan = set_optimizer_params_grad(param_optimizer, model.named_parameters(), test_nan=True)
File "./run_squad.py", line 691, in set_optimizer_params_grad
if test_nan and torch.isnan(param_model.grad).sum() > 0:
File "/people/sanjay/anaconda2/envs/bert_pytorch/lib/python3.5/site-packages/torch/functional.py", line 289, in isnan
raise ValueError("The argument is not a tensor", str(tensor))
ValueError: ('The argument is not a tensor', 'None')
Command:
CUDA_VISIBLE_DEVICES=0 python ./run_squad.py \
--vocab_file bert_large/uncased_L-24_H-1024_A-16/vocab.txt \
--bert_config_file bert_large/uncased_L-24_H-1024_A-16/bert_config.json \
--init_checkpoint bert_large/uncased_L-24_H-1024_A-16/pytorch_model.bin \
--do_lower_case \
--do_train \
--do_predict \
--train_file squad_dir/train-v1.1.json \
--predict_file squad_dir/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir outputs \
--train_batch_size 4 \
--gradient_accumulation_steps 2 \
--optimize_on_cpu
Error while using --optimize_on_cpu only.
Works fine without the argument.
GPU: Nvidia GTX 1080Ti Single GPU.
PS: I can only fit in train_batch_size 4 on the memory of a single GPU.
Thanks! I pushed a fix for that, you can try it again. You should be able to increase a bit the batch size.
By the way, the real batch size that is used on the gpu is train_batch_size / gradient_accumulation_steps
so 2
in your case. I think you should be able to go to 3
with --optimize_on_cpu
The recommended batch_size to get good results (EM, F1) with BERT large on SQuaD is 24
. You can try the following possibilities to get to this batch_size:
--train_batch_size 24 --gradient_accumulation_steps 12
--train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu
--train_batch_size 24 --gradient_accumulation_steps 6 or 4 --fp16
If your GPU supports fp16, the last solution should be the fastest, otherwise the second should be the fastest. The first solution should work out-of-the box and give better results (EM, F1) but you won't have any speed-up.
Should be fixed now. Don't hesitate to re-open an issue if needed. Thanks for the feedback!
Yes it works now!
With
--train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu
I get {"exact_match": 83.78429517502366, "f1": 90.75733469379139} which is pretty close.
Thanks for this amazing work!
Most helpful comment
Thanks! I pushed a fix for that, you can try it again. You should be able to increase a bit the batch size.
By the way, the real batch size that is used on the gpu is
train_batch_size / gradient_accumulation_steps
so2
in your case. I think you should be able to go to3
with--optimize_on_cpu
The recommended batch_size to get good results (EM, F1) with BERT large on SQuaD is
24
. You can try the following possibilities to get to this batch_size:--train_batch_size 24 --gradient_accumulation_steps 12
--train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu
--train_batch_size 24 --gradient_accumulation_steps 6 or 4 --fp16
If your GPU supports fp16, the last solution should be the fastest, otherwise the second should be the fastest. The first solution should work out-of-the box and give better results (EM, F1) but you won't have any speed-up.