Transformers: Example script for SQuAD question answering unable to reproduce the claimed performance

Created on 24 May 2020  路  5Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

The example script for SQuAD question answering (examples/question-answering/run-squad.py) fails to produce the correct results as claimed in the tutorial.
The correct performance is around f1 = 88.52, exact_match = 81.22 on SQuAD v1.1, but the script produces f1 = 81.97 and exact match = 73.80 instead.

To reproduce

Steps to reproduce the behavior:

  1. Install with the latest commit (a34a989)
  2. Download the SQuAD v1.1 dataset.
  3. Run examples/question-answering/run-squad.py. with the exact same arguments as seen in the tutorial.
export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Following is the final result.

05/24/2020 16:10:09 - INFO - __main__ - Running evaluation
05/24/2020 16:10:09 - INFO - __main__ - Num examples = 10789
05/24/2020 16:10:09 - INFO - __main__ - Batch size = 8
Evaluating: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 1349/1349 [01:31<00:00, 14.81it/s]
05/24/2020 16:11:41 - INFO - __main__ - Evaluation done in total 91.079697 secs (0.008442 sec per example)
05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing predictions to: out-noamp/predictions_.json
05/24/2020 16:11:41 - INFO - transformers.data.metrics.squad_metrics - Writing nbest to: out-noamp/nbest_predictions_.json
05/24/2020 16:12:09 - INFO - __main__ - Results: {'exact': 73.80321665089878, 'f1': 81.96651715123286, 'total': 10570, 'HasAns_exact': 73.80321665089878, 'HasAns_f1': 81.96651715123286, 'HasAns_total': 10570, 'best_exact': 73.80321665089878, 'best_exact_thresh': 0.0, 'best_f1': 81.96651715123286, 'best_f1_thresh': 0.0}

Expected behavior

The script should produce f1 = 88.52, exact_match = 81.22.

Environment info

  • transformers version: 2.10.0
  • Platform: Linux-4.15.0-99-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.5.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False
Question Answering

Most helpful comment

Found the bug. --do_lower_case was missing in the script arguments.

Now the results are pretty close to the ones mentioned in the tutorial.

05/24/2020 23:50:04 - INFO - __main__ - Results: {'exact': 80.26490066225166, 'f1': 88.01726518927101, 'total': 10570, 'HasAns_exact': 80.26490066225166, 'HasAns_f1': 88.01726518927101, 'HasAns_total': 10570, 'best_exact': 80.26490066225166, 'best_exact_thresh': 0.0, 'best_f1': 88.01726518927101, 'best_f1_thresh': 0.0}

All 5 comments

My results are different as well:

"exact_match": 71.92999053926206
"f1": 80.70949484221217

My guess is that this occurs because we are not using a fixed seed. The runs are not deterministic so difference _will_ occur.

Possibly but the difference of 7~8 points in f1 and EM scores is way above the usual variance due to random seeds.

Found the bug. --do_lower_case was missing in the script arguments.

Now the results are pretty close to the ones mentioned in the tutorial.

05/24/2020 23:50:04 - INFO - __main__ - Results: {'exact': 80.26490066225166, 'f1': 88.01726518927101, 'total': 10570, 'HasAns_exact': 80.26490066225166, 'HasAns_f1': 88.01726518927101, 'HasAns_total': 10570, 'best_exact': 80.26490066225166, 'best_exact_thresh': 0.0, 'best_f1': 88.01726518927101, 'best_f1_thresh': 0.0}

Possibly but the difference of 7~8 points in f1 and EM scores is way above the usual variance due to random seeds.

Unfortunately not. Have a look at these experiments by my friends over at NLP Town. They did sentiment analyses and ran the experiments ten times (each time with a different seed). https://www.linkedin.com/posts/nlp-town_sentimentanalysis-camembert-xlm-activity-6605379961111007232-KJy3

That being said, I do think you are right, good catch!

Closing this b/c #4245 was merged

(we still need to investigate why the lowercasing is not properly populated by the model's config)

Was this page helpful?
0 / 5 - 0 ratings