Azure-docs: Forms Recognizer returns error on some input files, even though they all use the same template

Created on 27 Jun 2019 · 8Comments · Source: MicrosoftDocs/azure-docs

The error message is "Word-level token extraction failed on document. string index out of range". I have 10 single page PDfs in a blob storage container that I am trying to train a model for, all of the PDFs use the same template so I'm confused as to why some encounter the errors, and others don't.

I have tried using cURL and PostMan, both return errors for the same PDFs, how can I isolate what the issues is inside the PDFs that fail when attempting to create and train a model?

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: a5fa2bcb-4ecb-fea1-5126-36986c2c94e9
Version Independent ID: c6cd5bd0-b35b-5e29-7f5e-db8cabecc7af
Content: Quickstart: Train a model and extract form data by using cURL - Form Recognizer - Azure Cognitive Services
Content Source: articles/cognitive-services/form-recognizer/quickstarts/curl-train-extract.md
Service: cognitive-services
Sub-service: form-recognizer
GitHub Login: @PatrickFarley
Microsoft Alias: pafarley

Pri2 assigned-to-author cognitive-servicesvc form-recognizesubsvc product-question triaged

Source

BeigeBadger

Most helpful comment

This error " string index out of range" should be fixed now. Please try running the data again.

NHaiby on 18 Jul 2019

👍2

All 8 comments

@BeigeBadger Thanks for the feedback. We are investigating into the issue and will update you shortly.

RohitMungi-MSFT on 27 Jun 2019

@BeigeBadger Could you please confirm if the error is seen while using the train API or Analyze API? I have noticed the same error while using the Analyze API when the document is of a different format that is not used while training.

@PatrickFarley Could you please let us know if there is a way to get the details of the error for a particular document?

RohitMungi-MSFT on 27 Jun 2019

@Rohit I get the error when using the Train API.

BeigeBadger on 27 Jun 2019

@NHaiby, this user appears to be having issues with PDF text extraction

PatrickFarley on 27 Jun 2019

I get the same error when I use the Train API.

That is my response:

Response status code: 200 Response body: { 'modelId': '9..', 'trainingDocuments': [{ 'documentName': '1.pdf', 'pages': 1, 'errors': ['Page 1: Word-level token extraction failed on document. string index out of range'], 'status': 'failure'

That is the response for all five documents in the blob storage