Hello everyone, I have large number of documents and I need to extract specific information through Hugging Face Question Answering Model. First issue I faced was the document size was very large so it gave me token error and afterwards, I divided the data into small paragraphs, then I applied the given model. But this time, answer was not accurate. So, I just want to know, is there any alternative method or model to do this.
A link to original question on Stack Overflow:
How long is your document, you may wanna try longformer model which can handle sequences upto 4096 tokens. Here's a longformer model trained for QA https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1
Also, take a loot at this https://github.com/deepset-ai/haystack. This might help you a lot
How long is your document, you may wanna try longformer model which can handle sequences upto 4096 tokens. Here's a longformer model trained for QA https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1
Also, take a loot at this https://github.com/deepset-ai/haystack. This might help you a lot
I looked into it. That is a great help. Thanks. Can we also decide the output length with these type of pretrained models?
Theses QA models aren't generative. So there's no output length constraint
@AishwaryaVerma - For QA the output length is usually very small (only a couple of words). It is very rare that the answer of AutoModelForQuestionAnswering is longer than 3,4 words.
You might also want to take a look at: https://huggingface.co/allenai/longformer-large-4096-finetuned-triviaqa
Most helpful comment
How long is your document, you may wanna try longformer model which can handle sequences upto 4096 tokens. Here's a longformer model trained for QA https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1
Also, take a loot at this https://github.com/deepset-ai/haystack. This might help you a lot