Transformers: ValueError while using --optimize_on_cpu

Created on 15 Nov 2018 · 3Comments · Source: huggingface/transformers

Traceback (most recent call last): | 1/87970 [00:00<8:35:35, 2.84it/s]
File "./run_squad.py", line 990, in
main()
File "./run_squad.py", line 922, in main
is_nan = set_optimizer_params_grad(param_optimizer, model.named_parameters(), test_nan=True)
File "./run_squad.py", line 691, in set_optimizer_params_grad
if test_nan and torch.isnan(param_model.grad).sum() > 0:
File "/people/sanjay/anaconda2/envs/bert_pytorch/lib/python3.5/site-packages/torch/functional.py", line 289, in isnan
raise ValueError("The argument is not a tensor", str(tensor))
ValueError: ('The argument is not a tensor', 'None')

Command:
CUDA_VISIBLE_DEVICES=0 python ./run_squad.py \
--vocab_file bert_large/uncased_L-24_H-1024_A-16/vocab.txt \
--bert_config_file bert_large/uncased_L-24_H-1024_A-16/bert_config.json \
--init_checkpoint bert_large/uncased_L-24_H-1024_A-16/pytorch_model.bin \
--do_lower_case \
--do_train \
--do_predict \
--train_file squad_dir/train-v1.1.json \
--predict_file squad_dir/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir outputs \
--train_batch_size 4 \
--gradient_accumulation_steps 2 \
--optimize_on_cpu

Error while using --optimize_on_cpu only.
Works fine without the argument.

GPU: Nvidia GTX 1080Ti Single GPU.

PS: I can only fit in train_batch_size 4 on the memory of a single GPU.

Source

rsanjaykamath

Most helpful comment

Thanks! I pushed a fix for that, you can try it again. You should be able to increase a bit the batch size.

By the way, the real batch size that is used on the gpu is train_batch_size / gradient_accumulation_steps so 2 in your case. I think you should be able to go to 3 with --optimize_on_cpu

The recommended batch_size to get good results (EM, F1) with BERT large on SQuaD is 24. You can try the following possibilities to get to this batch_size:

keeping the same 'real batch size' that you currently have but just a bigger batch_size --train_batch_size 24 --gradient_accumulation_steps 12
trying a 'real batch size' of 3 with optimization on cpu --train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu
switching to fp16 (implies optimization on cpu): --train_batch_size 24 --gradient_accumulation_steps 6 or 4 --fp16

If your GPU supports fp16, the last solution should be the fastest, otherwise the second should be the fastest. The first solution should work out-of-the box and give better results (EM, F1) but you won't have any speed-up.

thomwolf on 15 Nov 2018

👍2

All 3 comments

Thanks! I pushed a fix for that, you can try it again. You should be able to increase a bit the batch size.

By the way, the real batch size that is used on the gpu is train_batch_size / gradient_accumulation_steps so 2 in your case. I think you should be able to go to 3 with --optimize_on_cpu

The recommended batch_size to get good results (EM, F1) with BERT large on SQuaD is 24. You can try the following possibilities to get to this batch_size:

keeping the same 'real batch size' that you currently have but just a bigger batch_size --train_batch_size 24 --gradient_accumulation_steps 12
trying a 'real batch size' of 3 with optimization on cpu --train_batch_size 24 --gradient_accumulation_steps 8 --optimize_on_cpu
switching to fp16 (implies optimization on cpu): --train_batch_size 24 --gradient_accumulation_steps 6 or 4 --fp16

thomwolf on 15 Nov 2018

👍2

Should be fixed now. Don't hesitate to re-open an issue if needed. Thanks for the feedback!

thomwolf on 17 Nov 2018

👍1

Yes it works now!

With