The example script for SQuAD question answering (examples/question-answering/run-squad.py) fails to produce the correct results as claimed in the tutorial.
The correct performance is around f1 = 88.52, exact_match = 81.22 on SQuAD v1.1, but the script produces f1 = 81.97 and exact match = 73.80 instead.
Steps to reproduce the behavior:
examples/question-answering/run-squad.py. with the exact same arguments as seen in the tutorial.export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
Following is the final result.
05/24/2020 16:10:09 - INFO - __main__ - Running evaluation
05/24/2020 16:10:09 - INFO - __main__ - Num examples = 10789
05/24/2020 16:10:09 - INFO - __main__ - Batch size = 8
Evaluating: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 1349/1349 [01:31<00:00, 14.81it/s]
05/24/2020 16:11:41 - INFO - __main__ - Evaluation done in total 91.079697 secs (0.008442 sec per example)
05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing predictions to: out-noamp/predictions_.json
05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing nbest to: out-noamp/nbest_predictions_.json
05/24/2020 16:12:09 - INFO - __main__ - Results: {'exact': 73.80321665089878, 'f1': 81.96651715123286, 'total': 10570, 'HasAns_exact': 73.80321665089878, 'HasAns_f1': 81.96651715123286, 'HasAns_total': 10570, 'best_exact': 73.80321665089878, 'best_exact_thresh': 0.0, 'best_f1': 81.96651715123286, 'best_f1_thresh': 0.0}
The script should produce f1 = 88.52, exact_match = 81.22.
transformers version: 2.10.0My results are different as well:
"exact_match": 71.92999053926206
"f1": 80.70949484221217
My guess is that this occurs because we are not using a fixed seed. The runs are not deterministic so difference _will_ occur.
Possibly but the difference of 7~8 points in f1 and EM scores is way above the usual variance due to random seeds.
Found the bug. --do_lower_case was missing in the script arguments.
Now the results are pretty close to the ones mentioned in the tutorial.
05/24/2020 23:50:04 - INFO - __main__ - Results: {'exact': 80.26490066225166, 'f1': 88.01726518927101, 'total': 10570, 'HasAns_exact': 80.26490066225166, 'HasAns_f1': 88.01726518927101, 'HasAns_total': 10570, 'best_exact': 80.26490066225166, 'best_exact_thresh': 0.0, 'best_f1': 88.01726518927101, 'best_f1_thresh': 0.0}
Possibly but the difference of 7~8 points in f1 and EM scores is way above the usual variance due to random seeds.
Unfortunately not. Have a look at these experiments by my friends over at NLP Town. They did sentiment analyses and ran the experiments ten times (each time with a different seed). https://www.linkedin.com/posts/nlp-town_sentimentanalysis-camembert-xlm-activity-6605379961111007232-KJy3
That being said, I do think you are right, good catch!
Closing this b/c #4245 was merged
(we still need to investigate why the lowercasing is not properly populated by the model's config)
Most helpful comment
Found the bug.
--do_lower_casewas missing in the script arguments.Now the results are pretty close to the ones mentioned in the tutorial.
05/24/2020 23:50:04 - INFO - __main__ - Results: {'exact': 80.26490066225166, 'f1': 88.01726518927101, 'total': 10570, 'HasAns_exact': 80.26490066225166, 'HasAns_f1': 88.01726518927101, 'HasAns_total': 10570, 'best_exact': 80.26490066225166, 'best_exact_thresh': 0.0, 'best_f1': 88.01726518927101, 'best_f1_thresh': 0.0}