The error message is "Word-level token extraction failed on document. string index out of range". I have 10 single page PDfs in a blob storage container that I am trying to train a model for, all of the PDFs use the same template so I'm confused as to why some encounter the errors, and others don't.
I have tried using cURL and PostMan, both return errors for the same PDFs, how can I isolate what the issues is inside the PDFs that fail when attempting to create and train a model?
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@BeigeBadger Thanks for the feedback. We are investigating into the issue and will update you shortly.
@BeigeBadger Could you please confirm if the error is seen while using the train API or Analyze API? I have noticed the same error while using the Analyze API when the document is of a different format that is not used while training.
@PatrickFarley Could you please let us know if there is a way to get the details of the error for a particular document?
@Rohit I get the error when using the Train API.
@NHaiby, this user appears to be having issues with PDF text extraction
I get the same error when I use the Train API.
That is my response:
Response status code: 200
Response body: {
'modelId': '9..',
'trainingDocuments': [{
'documentName': '1.pdf',
'pages': 1,
'errors': ['Page 1: Word-level token extraction failed on document. string index out of range'],
'status': 'failure'
That is the response for all five documents in the blob storage
This error " string index out of range" should be fixed now. Please try running the data again.
@NHaiby
Hello, thanks for your feedback. My first test was successful!
Resolved
Most helpful comment
This error " string index out of range" should be fixed now. Please try running the data again.