Transformers: Train with custom data on bert question answering

Created on 28 Feb 2019 · 10Comments · Source: huggingface/transformers

Hi all

I have trained bert question answering on squad v 1 data set. As I was using colab which was slow . so I used 5000 examples from squad and trained the model which took 2 hrs and gave accuracy of 51%. My question is that
1) As i saved pytorch_bin file after trainining. Can i use this new bin file and again train next 5000 from squad.should i replace this bin file with old pytorch bin file created in uncased folder. What steps i need to follow
2) i have a custom data. To train on custom qyestion answer. Do i need to include same dataset(append) in squad / put this new file in training data=custom data.? How can i leverage squad trained model to further train on custom data

3) can anybody help me with script to convert my data to squad format

Detailed steps are appreciated fo leveraging squad traines model and train for custom data on top of same

Help wanted wontfix

Source

navdeep1604

👍4 👀1

Most helpful comment

This might help you setup a QA system with custom data, it's built on top of this repo: https://github.com/cdqa-suite/cdQA

gqoew on 27 Jun 2019

👍4

All 10 comments

You can put the pytorch_model.bin file that was output from your finetuning on squad in some other folder and set that folder as the bert_model='path/to/this/folder'. The folder needs to have the files bert_config.json and vocab.txt from the first pretrained model you used though.
I think you can first train on squad, then use the model to further train on your custom QA dataset, using that model (i.e. set bert_model as explained in 1.)
You can read the squad training data with:

import json
input_file = 'train-v1.1.json'
with open(input_file, "r", encoding='utf-8') as reader:
    input_data = json.load(reader)["data"]

The input data, under the top level "data" tag, holds "paragraphs" tags, which in turn holds texts in "context" tags, and questions and answers in "qas" tags. You can check the structure of the texts/questions/answers like this.

from pprint import pprint
pprint(input_data[0])

{'paragraphs': [{'context': 'Architecturally, the school has a Catholic '
                            "character. Atop the Main Building's gold dome is "
                            'a golden statue of the Virgin Mary. Immediately '
                            'in front of the Main Building and facing it, is a '
                            'copper statue of Christ with arms upraised with '
                            'the legend "Venite Ad Me Omnes". Next to the Main '
                            'Building is the Basilica of the Sacred Heart. '
                            'Immediately behind the basilica is the Grotto, a '
                            'Marian place of prayer and reflection. It is a '
                            'replica of the grotto at Lourdes, France where '
                            'the Virgin Mary reputedly appeared to Saint '
                            'Bernadette Soubirous in 1858. At the end of the '
                            'main drive (and in a direct line that connects '
                            'through 3 statues and the Gold Dome), is a '
                            'simple, modern stone statue of Mary.',
                 'qas': [{'answers': [{'answer_start': 515,
                                       'text': 'Saint Bernadette Soubirous'}],
                          'id': '5733be284776f41900661182',
                          'question': 'To whom did the Virgin Mary allegedly '
                                      'appear in 1858 in Lourdes France?'},
                         {'answers': [{'answer_start': 188,
                                       'text': 'a copper statue of Christ'}],
                          'id': '5733be284776f4190066117f',
                          'question': 'What is in front of the Notre Dame Main '
                                      'Building?'},
                         {'answers': [{'answer_start': 279,
                                       'text': 'the Main Building'}],
                          'id': '5733be284776f41900661180',
                          'question': 'The Basilica of the Sacred heart at '
                                      'Notre Dame is beside to which '
                                      'structure?'},
                         {'answers': [{'answer_start': 381,
                                       'text': 'a Marian place of prayer and '
                                               'reflection'}],
                          'id': '5733be284776f41900661181',
                          'question': 'What is the Grotto at Notre Dame?'},
                         {'answers': [{'answer_start': 92,
                                       'text': 'a golden statue of the Virgin '
                                               'Mary'}],
                          'id': '5733be284776f4190066117e',
                          'question': 'What sits on top of the Main Building '
                                      'at Notre Dame?'}]},
                {'context': "As at most other universities, Notre Dame's .... 

(many more context and qas tags are printed here)

The conversion from your custom data to this format depends on the current format of your data. But if you can create a python dict looking like this with your data, you can make a json file from it and use it as training data in the run_squad.py script.

maxlund on 15 Mar 2019

@navdeep1604 or @maxlund or @thomwolf : Was a custom training done and tested? We faced few issues like:

After training, previous correct questions started getting wrong.
All questions are started answering same answer
All questions started answering something wrong

Would anyone like to share observations, if same or different problems faced. And curious to know what actions or tricks were made to fix these issues.

ghost on 30 Apr 2019

This might help you setup a QA system with custom data, it's built on top of this repo: https://github.com/cdqa-suite/cdQA

gqoew on 27 Jun 2019

👍4

Hi @SandeepBhutani ,
I faced similar issue, since my custom training data (240 QA pairs) was very less.

Swathygsb on 17 Jul 2019

Hi, for anyone who has made a custom QA dataset, how did you go about get the start position and end position for the answers or did you already have them easily accessible? I have a large dataset set of questions with corresponding context given by people; however, I don't have the specific answers as there can be many acceptable answers. My goal is to determine whether the context contains an answer to the question (similar to squad 2.0). Preliminary results after fine tuning on Squad 2.0 weren't super great so I wanted to add more examples. Any recs on how I could label my data in the correct format for say a bert or would I need to crowd source labels from a vendor?

cformosa on 15 Aug 2019

Hi @cformosa,
The package for QA system mentioned above also has an annotation tool that can help you with that task:
https://github.com/cdqa-suite/cdQA-annotator

andrelmfarias on 12 Sep 2019

Thanks for the link @andrelmfarias . I was looking over it and it seems extremely useful for sure. Seems like it will take a long time to generate a large corpus of training data but nevertheless its seems quite helpful. Thanks!

cformosa on 13 Sep 2019

Hi @andrelmfarias, thank you for sharing this great resource!

The cdQA-suite seems to cater to a specific kind of question answering, as described in your Medium article. To summarise, it looks for the answer to a question from a collection of documents -- all these documents most likely contain different kinds of information regarding a particular topic. For example, a system built using cdQA could contain 10 different documents regarding 10 different historical periods, and you could ask it a question about any of these time periods, and it would search for the relevant answer within these 10 documents.

However, if the system you want to build is such: you have 10 court orders, and you want to ask the system the same set of questions for each court order. For example:

When was the order filed?
Who filed the order?
Who was the order filed against?
Was there a settlement?

In this case, I wouldn't want the system to search through every document, but instead look for answers within the document itself. Exactly like SQuaD 2.0.
My assessment is that I wouldn't be able to build such a system using cdQA but I could use the cdQA annotator to build my dataset. Is that a sound assessment?

Also, I'm curious to hear your thoughts on how feasible it would be to expect good results when the context is rather long (anywhere between 2-10 pages).
Thank you :)

rsomani95 on 4 Oct 2019

Hi @rsomani95 ,
As your questions are particularly related to cdQA I opened an issue with your questions in our repository to avoid spamming here: https://github.com/cdqa-suite/cdQA/issues/275

I just answered them there.

andrelmfarias on 7 Oct 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.