Transformers: run_squad questions

Created on 5 Nov 2018 · 15Comments · Source: huggingface/transformers

Thanks a lot for the port! I have some minor questions, for the run_squad file, I see two options for accumulating gradients, accumulate_gradients and gradient_accumulation_steps but it seems to me that it can be combined into one. The other one is for the global_step variable, seems we are only counting but not using this variable in gradient accumulating. Thanks again!

Source

ZhaoyueCheng

Most helpful comment

Ok guys thanks for waiting, we've nailed down the culprit which was in fact a bug in the pre-processing logic (more exactly this dumb typo https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/run_squad.py#L865).

I took the occasion to clean up a few things I noticed while walking through the code:

the weight initialization was not optimal (tf. truncated_normal_initializer(stddev=0.02) was translated in weight.data.normal_(0.02) instead of weight.data.normal_(mean=0.0, std=0.02) which likely affected the performance of run_classifer.py also.
gradient accumulation loss was not averaged over the accumulation steps which would have required to change the hyper-parameters for using accumulation.
the evaluation was not done with torch.no_grad() and thus sub-optimal in terms of speed/memory.

These fixes are pushed on the develop branch right now.

All in all I think we are pretty good now and none of these issues affected the core PyTorch model (the BERT Transformer it-self) so if you only used extract_features.py you were good from the beginning. And run_classifer.py was ok apart from the sub-optimal additional weights initialization.

I will merge the develop branch as soon as we got the final results confirmed (currently it's been training for 20 minutes (0.3 epoch) on 4GPU with a batch size of 56 and we are already above 85 on F1 on SQuAD and 77 in exact match so I'm rather confident and I think you guys can play with it too now).

I am also cleaning up the code base to prepare for a first release that we will put on pip for easier access.

thomwolf on 7 Nov 2018

👍8

All 15 comments

It also seems to me that the SQuAD 1.1 can not reproduce the google tensorflow version performance.

ZhaoyueCheng on 5 Nov 2018

It also seems to me that the SQuAD 1.1 can not reproduce the google tensorflow version performance.

What batch size are you running?

abeljim on 6 Nov 2018

I'm running on 4 GPU with a batch size of 48, the result is {"exact_match": 21.551561021759696, "f1": 41.785968963154055}

ZhaoyueCheng on 6 Nov 2018

👍2

Just ran on 1 GPU batch size of 10, the result is {"exact_match": 21.778618732261116, "f1": 41.83593185416649}
Actually it might be with the eval code Ill look into it

abeljim on 6 Nov 2018

Sure, Thanks, I'm checking for the reason too, will report if find anything.

ZhaoyueCheng on 6 Nov 2018

The predictions file is only outputting one word. Need to find out if the bug is in the model itself or write predictions function in run_squad.py. The correct answer always seems to be in the nbest_predictions, but its never selected.

abeljim on 6 Nov 2018

What performance does Hugging Face get on SQuAD using this reimplementation?

ethanjperez on 6 Nov 2018

Hi all,
We were not able to try SQuAD on a multi-GPU with the correct batch_size until recently so we relied on the standard deviations computed in the notebooks to compare the predicted hidden states and losses for the SQuAD script. I was able to try on a multi-GPU today and there is indeed a strong difference.
We got about the same results that you get: F1 of 41.8 and exact match of 21.7.
I am investigating that right now, my personal guess is that this may be related to things outside the model it-self like the optimizer or the post-processing in SQuAD as these were not compared between the TF and PT models.
I will keep you guys updated in this issue and I add a mention in the readme that the SQuAD example doesn't work yet.
If you have some insights, feel free to participate in the discussion.

thomwolf on 6 Nov 2018

If you're comparing activations, it may be worth comparing gradients as well to see if you receive similarly low gradients standard deviations for identical batches. You might see that the gradient is not comparable from the last layer itself (due to e.g. difference in how PyTorch may handle weight decay / optimization differently); you may also see that gradients only become not comparable only after a particular point in backpropagation, and that would show perhaps that the backward pass for a particular function differs between PyTorch and Tensorflow

ethanjperez on 6 Nov 2018

I took the occasion to clean up a few things I noticed while walking through the code:

the weight initialization was not optimal (tf. truncated_normal_initializer(stddev=0.02) was translated in weight.data.normal_(0.02) instead of weight.data.normal_(mean=0.0, std=0.02) which likely affected the performance of run_classifer.py also.
gradient accumulation loss was not averaged over the accumulation steps which would have required to change the hyper-parameters for using accumulation.
the evaluation was not done with torch.no_grad() and thus sub-optimal in terms of speed/memory.

These fixes are pushed on the develop branch right now.

I am also cleaning up the code base to prepare for a first release that we will put on pip for easier access.

thomwolf on 7 Nov 2018

👍8

@thomwolf This is awesome - thank you! Do you know what the final SQuAD results were from the training run you started?

ethanjperez on 8 Nov 2018

I got {"exact_match": 80.07568590350047, "f1": 87.6494485519583} with slightly sub-optimal parameters (max_seq 300 instead of 384 which means more answers are truncated and a batch_size 56 for 2 epochs of training which is probably a too big batch size and/or 1 epoch should suffice).

It trains in about 1h/epoch on 4 GPUs with such a big batch size and truncated examples.

thomwolf on 8 Nov 2018

👍1

Using the same HP as the TensorFlow version we are actually slightly better on F1 than the original implementation (on the default random seed we used):
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
versus TF: {"f1": 88.41249612335034, "exact_match": 81.2488174077578}

I am trying BERT-large on SQuAD now which is totally do-able on a 4 GPU server with the recommended batch-size of 24 (about 16h of expected training time using the --optimize_on_cpu option and 2 steps of gradient accumulation). I will update the readme with the results.

thomwolf on 9 Nov 2018

👍1

Great, I saw the BERT-large ones as well - thank you for sharing these results! How long did the BERT-base SQuAD training take on a single GPU when you tried it? I saw BERT-large took ~18 hours over 4 K-80's

ethanjperez on 12 Nov 2018

Hi Ethan, I didn't try SQuAD on a single-GPU. On four k-80 (not k40), BERT-base took 5h to train on SQuAD.

thomwolf on 12 Nov 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings