I'm trying to reproduce RAG-Sequence NQ score of 44.5 presented in Table 1 of the paper at https://arxiv.org/abs/2005.11401.
I used the command in the examples/rag readme
python examples/rag/eval_rag.py \
--model_name_or_path facebook/rag-sequence-nq \
--model_type rag_sequence \
--evaluation_set path/to/test.source \
--gold_data_path path/to/gold_data \
--predictions_path path/to/e2e_preds.txt \
--eval_mode e2e \
--gold_data_mode qa \
--n_docs 5 \
--print_predictions \
--recalculate \
For gold_data_path I used data.retriever.qas.nq-test from DPR repo, consisting of 3610 questions and answers: https://github.com/facebookresearch/DPR/blob/master/data/download_data.py#L91-L97
For evaluation_set, my understanding it should be the questions, so I extracted just the questions from the qas.nq-test csv file.
I tried the above command with n_docs 5 and 10, with the following results:
n_docs 5
INFO:__main__:F1: 49.67
INFO:__main__:EM: 42.58
n_docs 10
INFO:__main__:F1: 50.62
INFO:__main__:EM: 43.49
With n_docs 10 it's still 1 point below the score in paper. What would be the proper setup to reproduce the number, is the pretrained model loaded different, higher n_docs, or different test data?
Thanks in advance!
Gently pinging @ola13 here, she probably knows best which command to run to reproduce the eval results :-)
Hi @acslk, thanks for your post!
You should be able to reproduce paper results for the RAG Token model (44.1 EM on NQ) by evaluating facebook/rag-token-nq with 20 docs.
As for the RAG Sequence model - we have lost some quality when translating the checkpoint from fairseq (the experimentation framework we used to obtain the original paper results) to HuggingFace. We are now working on replicating the paper numbers in HF and we'll update the official facebook/rag-sequence-nq model weights once we have that so stay tuned!
Thanks for the response, I tried the command above with RAG Token model and n_docs 20 on NQ test set and can confirm it matches paper results:
INFO:__main__:F1: 51.44
INFO:__main__:EM: 44.10
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Thanks for the response, I tried the command above with RAG Token model and n_docs 20 on NQ test set and can confirm it matches paper results:
INFO:__main__:F1: 51.44
INFO:__main__:EM: 44.10