Transformers: Example script for SQuAD question answering unable to reproduce the claimed performance

Created on 24 May 2020 · 5Comments · Source: huggingface/transformers

🐛 Bug

Information

The example script for SQuAD question answering (examples/question-answering/run-squad.py) fails to produce the correct results as claimed in the tutorial.
The correct performance is around f1 = 88.52, exact_match = 81.22 on SQuAD v1.1, but the script produces f1 = 81.97 and exact match = 73.80 instead.

To reproduce

Steps to reproduce the behavior:

Install with the latest commit (a34a989)
Download the SQuAD v1.1 dataset.
Run examples/question-answering/run-squad.py. with the exact same arguments as seen in the tutorial.

export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Following is the final result.

05/24/2020 16:10:09 - INFO - __main__ - Running evaluation
05/24/2020 16:10:09 - INFO - __main__ - Num examples = 10789
05/24/2020 16:10:09 - INFO - __main__ - Batch size = 8
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1349/1349 [01:31<00:00, 14.81it/s]
05/24/2020 16:11:41 - INFO - __main__ - Evaluation done in total 91.079697 secs (0.008442 sec per example)
05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing predictions to: out-noamp/predictions_.json
05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing nbest to: out-noamp/nbest_predictions_.json
05/24/2020 16:12:09 - INFO - __main__ - Results: {'exact': 73.80321665089878, 'f1': 81.96651715123286, 'total': 10570, 'HasAns_exact': 73.80321665089878, 'HasAns_f1': 81.96651715123286, 'HasAns_total': 10570, 'best_exact': 73.80321665089878, 'best_exact_thresh': 0.0, 'best_f1': 81.96651715123286, 'best_f1_thresh': 0.0}

Expected behavior

The script should produce f1 = 88.52, exact_match = 81.22.

Environment info

transformers version: 2.10.0
Platform: Linux-4.15.0-99-generic-x86_64-with-debian-buster-sid
Python version: 3.7.7
PyTorch version (GPU?): 1.5.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Question Answering

Source

kaniblu

Most helpful comment

Found the bug. --do_lower_case was missing in the script arguments.

Now the results are pretty close to the ones mentioned in the tutorial.

05/24/2020 23:50:04 - INFO - __main__ - Results: {'exact': 80.26490066225166, 'f1': 88.01726518927101, 'total': 10570, 'HasAns_exact': 80.26490066225166, 'HasAns_f1': 88.01726518927101, 'HasAns_total': 10570, 'best_exact': 80.26490066225166, 'best_exact_thresh': 0.0, 'best_f1': 88.01726518927101, 'best_f1_thresh': 0.0}

kaniblu on 24 May 2020

🎉1 👍1

All 5 comments

My results are different as well:

"exact_match": 71.92999053926206
"f1": 80.70949484221217

My guess is that this occurs because we are not using a fixed seed. The runs are not deterministic so difference _will_ occur.

BramVanroy on 24 May 2020

Possibly but the difference of 7~8 points in f1 and EM scores is way above the usual variance due to random seeds.

kaniblu on 24 May 2020

Found the bug. --do_lower_case was missing in the script arguments.

Now the results are pretty close to the ones mentioned in the tutorial.

kaniblu on 24 May 2020

🎉1 👍1

Possibly but the difference of 7~8 points in f1 and EM scores is way above the usual variance due to random seeds.

Unfortunately not. Have a look at these experiments by my friends over at NLP Town. They did sentiment analyses and ran the experiments ten times (each time with a different seed). https://www.linkedin.com/posts/nlp-town_sentimentanalysis-camembert-xlm-activity-6605379961111007232-KJy3

That being said, I do think you are right, good catch!

BramVanroy on 24 May 2020

Closing this b/c #4245 was merged

(we still need to investigate why the lowercasing is not properly populated by the model's config)